Educational Process Mining¶

https://archive-beta.ics.uci.edu/dataset/346/educational+process+mining+epm+a+learning+analytics+data+set

Educational Process Mining (EPM): A Learning Analytics Data Set. (2015). UCI Machine Learning Repository.

Poisson Regression to model final questions¶

The final exam consisted of 16 questions where each question focused on content from a specific session. Each session has two or more related questions follows:

  • Session 1: two questions
  • Session 2: two questions
  • Session 3: five questions
  • Session 4: two questions
  • Session 5: three questions
  • Session 6: two questions

The questions correspond to trials, and the number of questions answered correctly correspond to events. Student participation (input data) was recorded by activity, exercise, and session. The number of trials and events for each student and session are counts, which can be modeled using Poisson or Negative Binomial Regression. This notebook uses both Poisson and Negative Binomial Regression to model the number of expected correctly answered final questions.

The original values and principle components were used with both Poisson and Negative Binomial Regression methods. Using the AIC as the performance metric, the best model from all four methods was the purely additive model with categorical features sid and actv_grp and numeric features, either the interpolated variables or the principle components: </br> final_events ~ sid + actv_grp + numeric_features

The fact that the student ID, activity group, and numeric features were important indicates that both the student and the student's behavior as measured by mouse and keyboard activity are required to determine the outcome.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
In [2]:
import statsmodels.formula.api as smf
import statsmodels.api as sm
In [3]:
from patsy import dmatrices

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

Include functions¶

CMPINF2120_EPM_FUNC_INCL_Over_Lisa.ipynb includes functions used in this notebook.

In [4]:
%run CMPINF2120_EPM_FUNC_INCL_Over_Lisa.ipynb

Load the data from github repository¶

In [5]:
inputs_final_sqrt_path = 'https://raw.githubusercontent.com/lisaover/CMPINF2120_project/main/tp_sqrt_inputs_final_df.csv'
final_path = 'https://raw.githubusercontent.com/lisaover/CMPINF2120_project/main/final_df.csv'
pts_path = 'https://raw.githubusercontent.com/lisaover/CMPINF2120_project/main/final_points_lookup.csv'
In [6]:
final_sqrt_init = pd.read_csv(inputs_final_sqrt_path)
final_init = pd.read_csv(final_path)
pts_final_lookup = pd.read_csv(pts_path)
In [7]:
final_sqrt_init.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2444 entries, 0 to 2443
Data columns (total 82 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   sess                 2444 non-null   int64  
 1   sid                  2444 non-null   int64  
 2   actv_grp             2444 non-null   object 
 3   total_ms_tp000_sqrt  2444 non-null   float64
 4   mw_tp000_sqrt        2444 non-null   float64
 5   mwc_tp000_sqrt       2444 non-null   float64
 6   mcl_tp000_sqrt       2444 non-null   float64
 7   mcr_tp000_sqrt       2444 non-null   float64
 8   mm_tp000_sqrt        2444 non-null   float64
 9   ks_tp000_sqrt        2444 non-null   float64
 10  total_ms_tp010_sqrt  2444 non-null   float64
 11  mw_tp010_sqrt        2444 non-null   float64
 12  mwc_tp010_sqrt       2444 non-null   float64
 13  mcl_tp010_sqrt       2444 non-null   float64
 14  mcr_tp010_sqrt       2444 non-null   float64
 15  mm_tp010_sqrt        2444 non-null   float64
 16  ks_tp010_sqrt        2444 non-null   float64
 17  total_ms_tp020_sqrt  2444 non-null   float64
 18  mw_tp020_sqrt        2444 non-null   float64
 19  mwc_tp020_sqrt       2444 non-null   float64
 20  mcl_tp020_sqrt       2444 non-null   float64
 21  mcr_tp020_sqrt       2444 non-null   float64
 22  mm_tp020_sqrt        2444 non-null   float64
 23  ks_tp020_sqrt        2444 non-null   float64
 24  total_ms_tp030_sqrt  2444 non-null   float64
 25  mw_tp030_sqrt        2444 non-null   float64
 26  mwc_tp030_sqrt       2444 non-null   float64
 27  mcl_tp030_sqrt       2444 non-null   float64
 28  mcr_tp030_sqrt       2444 non-null   float64
 29  mm_tp030_sqrt        2444 non-null   float64
 30  ks_tp030_sqrt        2444 non-null   float64
 31  total_ms_tp040_sqrt  2444 non-null   float64
 32  mw_tp040_sqrt        2444 non-null   float64
 33  mwc_tp040_sqrt       2444 non-null   float64
 34  mcl_tp040_sqrt       2444 non-null   float64
 35  mcr_tp040_sqrt       2444 non-null   float64
 36  mm_tp040_sqrt        2444 non-null   float64
 37  ks_tp040_sqrt        2444 non-null   float64
 38  total_ms_tp050_sqrt  2444 non-null   float64
 39  mw_tp050_sqrt        2444 non-null   float64
 40  mwc_tp050_sqrt       2444 non-null   float64
 41  mcl_tp050_sqrt       2444 non-null   float64
 42  mcr_tp050_sqrt       2444 non-null   float64
 43  mm_tp050_sqrt        2444 non-null   float64
 44  ks_tp050_sqrt        2444 non-null   float64
 45  total_ms_tp060_sqrt  2444 non-null   float64
 46  mw_tp060_sqrt        2444 non-null   float64
 47  mwc_tp060_sqrt       2444 non-null   float64
 48  mcl_tp060_sqrt       2444 non-null   float64
 49  mcr_tp060_sqrt       2444 non-null   float64
 50  mm_tp060_sqrt        2444 non-null   float64
 51  ks_tp060_sqrt        2444 non-null   float64
 52  total_ms_tp070_sqrt  2444 non-null   float64
 53  mw_tp070_sqrt        2444 non-null   float64
 54  mwc_tp070_sqrt       2444 non-null   float64
 55  mcl_tp070_sqrt       2444 non-null   float64
 56  mcr_tp070_sqrt       2444 non-null   float64
 57  mm_tp070_sqrt        2444 non-null   float64
 58  ks_tp070_sqrt        2444 non-null   float64
 59  total_ms_tp080_sqrt  2444 non-null   float64
 60  mw_tp080_sqrt        2444 non-null   float64
 61  mwc_tp080_sqrt       2444 non-null   float64
 62  mcl_tp080_sqrt       2444 non-null   float64
 63  mcr_tp080_sqrt       2444 non-null   float64
 64  mm_tp080_sqrt        2444 non-null   float64
 65  ks_tp080_sqrt        2444 non-null   float64
 66  total_ms_tp090_sqrt  2444 non-null   float64
 67  mw_tp090_sqrt        2444 non-null   float64
 68  mwc_tp090_sqrt       2444 non-null   float64
 69  mcl_tp090_sqrt       2444 non-null   float64
 70  mcr_tp090_sqrt       2444 non-null   float64
 71  mm_tp090_sqrt        2444 non-null   float64
 72  ks_tp090_sqrt        2444 non-null   float64
 73  total_ms_tp100_sqrt  2444 non-null   float64
 74  mw_tp100_sqrt        2444 non-null   float64
 75  mwc_tp100_sqrt       2444 non-null   float64
 76  mcl_tp100_sqrt       2444 non-null   float64
 77  mcr_tp100_sqrt       2444 non-null   float64
 78  mm_tp100_sqrt        2444 non-null   float64
 79  ks_tp100_sqrt        2444 non-null   float64
 80  final_events         2444 non-null   float64
 81  final_trials         2444 non-null   float64
dtypes: float64(79), int64(2), object(1)
memory usage: 1.5+ MB
In [8]:
final_sqrt_init.isna().sum()
Out[8]:
sess                   0
sid                    0
actv_grp               0
total_ms_tp000_sqrt    0
mw_tp000_sqrt          0
                      ..
mcr_tp100_sqrt         0
mm_tp100_sqrt          0
ks_tp100_sqrt          0
final_events           0
final_trials           0
Length: 82, dtype: int64

Prepare final data for count visualization¶

In [9]:
final_init.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   sid          62 non-null     int64  
 1   Es_1q1       62 non-null     float64
 2   Es_1q2       62 non-null     float64
 3   Es_2q1       62 non-null     float64
 4   Es_2q2       62 non-null     float64
 5   Es_3q1       62 non-null     float64
 6   Es_3q2       62 non-null     float64
 7   Es_3q3       62 non-null     float64
 8   Es_3q4       62 non-null     float64
 9   Es_3q5       62 non-null     float64
 10  Es_4q1       62 non-null     float64
 11  Es_4q2       62 non-null     float64
 12  Es_5q1       62 non-null     float64
 13  Es_5q2       62 non-null     float64
 14  Es_5q3       62 non-null     float64
 15  Es_6q1       62 non-null     float64
 16  Es_6q2       62 non-null     float64
 17  final_score  62 non-null     float64
dtypes: float64(17), int64(1)
memory usage: 8.8 KB
In [10]:
final_init.isna().sum()
Out[10]:
sid            0
Es_1q1         0
Es_1q2         0
Es_2q1         0
Es_2q2         0
Es_3q1         0
Es_3q2         0
Es_3q3         0
Es_3q4         0
Es_3q5         0
Es_4q1         0
Es_4q2         0
Es_5q1         0
Es_5q2         0
Es_5q3         0
Es_6q1         0
Es_6q2         0
final_score    0
dtype: int64
In [11]:
pts_final_lookup.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   question         17 non-null     object
 1   question_points  17 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 400.0+ bytes
In [12]:
pts_final_lookup.isna().sum()
Out[12]:
question           0
question_points    0
dtype: int64

Melt final_init and create a session variable

In [13]:
final_lf = final_init.melt( id_vars=['sid']).\
             rename(columns={"variable": "question", "value": "quest_scr"}).\
             copy()
In [14]:
final_lf['sess'] = final_lf.question.str.extract('(\d)')

Merge final_lf with pts_init and create a pass/fail variable for each student and question

In [15]:
final_lf_b = pd.merge(final_lf, pts_final_lookup, on='question', how='left')
In [16]:
final_lf_b.head()
Out[16]:
sid question quest_scr sess question_points
0 1 Es_1q1 2.0 1 2
1 2 Es_1q1 2.0 1 2
2 4 Es_1q1 2.0 1 2
3 5 Es_1q1 2.0 1 2
4 7 Es_1q1 2.0 1 2
In [17]:
final_lf_b['Qpass'] = [1 if i/j >= 0.7 else 0 for (i, j) in zip(final_lf_b['quest_scr'],final_lf_b['question_points'])]
In [18]:
final_lf_b.head()
Out[18]:
sid question quest_scr sess question_points Qpass
0 1 Es_1q1 2.0 1 2 1
1 2 Es_1q1 2.0 1 2 1
2 4 Es_1q1 2.0 1 2 1
3 5 Es_1q1 2.0 1 2 1
4 7 Es_1q1 2.0 1 2 1

Prepare input/final data for modeling¶

In [19]:
final_sqrt_init['sid'] = final_sqrt_init['sid'].astype('object')
final_sqrt_init['sess'] = final_sqrt_init['sess'].astype('object')
In [20]:
final_sqrt_df = final_sqrt_init.copy()
In [21]:
sqrt_vars = get_var_list(final_sqrt_df,['sqrt'])
In [22]:
sqrt_features_df = final_sqrt_df.loc[:, sqrt_vars].copy()
In [23]:
sqrt_features_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2444 entries, 0 to 2443
Data columns (total 77 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   total_ms_tp000_sqrt  2444 non-null   float64
 1   mw_tp000_sqrt        2444 non-null   float64
 2   mwc_tp000_sqrt       2444 non-null   float64
 3   mcl_tp000_sqrt       2444 non-null   float64
 4   mcr_tp000_sqrt       2444 non-null   float64
 5   mm_tp000_sqrt        2444 non-null   float64
 6   ks_tp000_sqrt        2444 non-null   float64
 7   total_ms_tp010_sqrt  2444 non-null   float64
 8   mw_tp010_sqrt        2444 non-null   float64
 9   mwc_tp010_sqrt       2444 non-null   float64
 10  mcl_tp010_sqrt       2444 non-null   float64
 11  mcr_tp010_sqrt       2444 non-null   float64
 12  mm_tp010_sqrt        2444 non-null   float64
 13  ks_tp010_sqrt        2444 non-null   float64
 14  total_ms_tp020_sqrt  2444 non-null   float64
 15  mw_tp020_sqrt        2444 non-null   float64
 16  mwc_tp020_sqrt       2444 non-null   float64
 17  mcl_tp020_sqrt       2444 non-null   float64
 18  mcr_tp020_sqrt       2444 non-null   float64
 19  mm_tp020_sqrt        2444 non-null   float64
 20  ks_tp020_sqrt        2444 non-null   float64
 21  total_ms_tp030_sqrt  2444 non-null   float64
 22  mw_tp030_sqrt        2444 non-null   float64
 23  mwc_tp030_sqrt       2444 non-null   float64
 24  mcl_tp030_sqrt       2444 non-null   float64
 25  mcr_tp030_sqrt       2444 non-null   float64
 26  mm_tp030_sqrt        2444 non-null   float64
 27  ks_tp030_sqrt        2444 non-null   float64
 28  total_ms_tp040_sqrt  2444 non-null   float64
 29  mw_tp040_sqrt        2444 non-null   float64
 30  mwc_tp040_sqrt       2444 non-null   float64
 31  mcl_tp040_sqrt       2444 non-null   float64
 32  mcr_tp040_sqrt       2444 non-null   float64
 33  mm_tp040_sqrt        2444 non-null   float64
 34  ks_tp040_sqrt        2444 non-null   float64
 35  total_ms_tp050_sqrt  2444 non-null   float64
 36  mw_tp050_sqrt        2444 non-null   float64
 37  mwc_tp050_sqrt       2444 non-null   float64
 38  mcl_tp050_sqrt       2444 non-null   float64
 39  mcr_tp050_sqrt       2444 non-null   float64
 40  mm_tp050_sqrt        2444 non-null   float64
 41  ks_tp050_sqrt        2444 non-null   float64
 42  total_ms_tp060_sqrt  2444 non-null   float64
 43  mw_tp060_sqrt        2444 non-null   float64
 44  mwc_tp060_sqrt       2444 non-null   float64
 45  mcl_tp060_sqrt       2444 non-null   float64
 46  mcr_tp060_sqrt       2444 non-null   float64
 47  mm_tp060_sqrt        2444 non-null   float64
 48  ks_tp060_sqrt        2444 non-null   float64
 49  total_ms_tp070_sqrt  2444 non-null   float64
 50  mw_tp070_sqrt        2444 non-null   float64
 51  mwc_tp070_sqrt       2444 non-null   float64
 52  mcl_tp070_sqrt       2444 non-null   float64
 53  mcr_tp070_sqrt       2444 non-null   float64
 54  mm_tp070_sqrt        2444 non-null   float64
 55  ks_tp070_sqrt        2444 non-null   float64
 56  total_ms_tp080_sqrt  2444 non-null   float64
 57  mw_tp080_sqrt        2444 non-null   float64
 58  mwc_tp080_sqrt       2444 non-null   float64
 59  mcl_tp080_sqrt       2444 non-null   float64
 60  mcr_tp080_sqrt       2444 non-null   float64
 61  mm_tp080_sqrt        2444 non-null   float64
 62  ks_tp080_sqrt        2444 non-null   float64
 63  total_ms_tp090_sqrt  2444 non-null   float64
 64  mw_tp090_sqrt        2444 non-null   float64
 65  mwc_tp090_sqrt       2444 non-null   float64
 66  mcl_tp090_sqrt       2444 non-null   float64
 67  mcr_tp090_sqrt       2444 non-null   float64
 68  mm_tp090_sqrt        2444 non-null   float64
 69  ks_tp090_sqrt        2444 non-null   float64
 70  total_ms_tp100_sqrt  2444 non-null   float64
 71  mw_tp100_sqrt        2444 non-null   float64
 72  mwc_tp100_sqrt       2444 non-null   float64
 73  mcl_tp100_sqrt       2444 non-null   float64
 74  mcr_tp100_sqrt       2444 non-null   float64
 75  mm_tp100_sqrt        2444 non-null   float64
 76  ks_tp100_sqrt        2444 non-null   float64
dtypes: float64(77)
memory usage: 1.4 MB
In [24]:
sqrt_feature_names = sqrt_features_df.columns
In [25]:
len(sqrt_feature_names)
Out[25]:
77

Visualizations¶

In [26]:
final_sqrt_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2444 entries, 0 to 2443
Data columns (total 82 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   sess                 2444 non-null   object 
 1   sid                  2444 non-null   object 
 2   actv_grp             2444 non-null   object 
 3   total_ms_tp000_sqrt  2444 non-null   float64
 4   mw_tp000_sqrt        2444 non-null   float64
 5   mwc_tp000_sqrt       2444 non-null   float64
 6   mcl_tp000_sqrt       2444 non-null   float64
 7   mcr_tp000_sqrt       2444 non-null   float64
 8   mm_tp000_sqrt        2444 non-null   float64
 9   ks_tp000_sqrt        2444 non-null   float64
 10  total_ms_tp010_sqrt  2444 non-null   float64
 11  mw_tp010_sqrt        2444 non-null   float64
 12  mwc_tp010_sqrt       2444 non-null   float64
 13  mcl_tp010_sqrt       2444 non-null   float64
 14  mcr_tp010_sqrt       2444 non-null   float64
 15  mm_tp010_sqrt        2444 non-null   float64
 16  ks_tp010_sqrt        2444 non-null   float64
 17  total_ms_tp020_sqrt  2444 non-null   float64
 18  mw_tp020_sqrt        2444 non-null   float64
 19  mwc_tp020_sqrt       2444 non-null   float64
 20  mcl_tp020_sqrt       2444 non-null   float64
 21  mcr_tp020_sqrt       2444 non-null   float64
 22  mm_tp020_sqrt        2444 non-null   float64
 23  ks_tp020_sqrt        2444 non-null   float64
 24  total_ms_tp030_sqrt  2444 non-null   float64
 25  mw_tp030_sqrt        2444 non-null   float64
 26  mwc_tp030_sqrt       2444 non-null   float64
 27  mcl_tp030_sqrt       2444 non-null   float64
 28  mcr_tp030_sqrt       2444 non-null   float64
 29  mm_tp030_sqrt        2444 non-null   float64
 30  ks_tp030_sqrt        2444 non-null   float64
 31  total_ms_tp040_sqrt  2444 non-null   float64
 32  mw_tp040_sqrt        2444 non-null   float64
 33  mwc_tp040_sqrt       2444 non-null   float64
 34  mcl_tp040_sqrt       2444 non-null   float64
 35  mcr_tp040_sqrt       2444 non-null   float64
 36  mm_tp040_sqrt        2444 non-null   float64
 37  ks_tp040_sqrt        2444 non-null   float64
 38  total_ms_tp050_sqrt  2444 non-null   float64
 39  mw_tp050_sqrt        2444 non-null   float64
 40  mwc_tp050_sqrt       2444 non-null   float64
 41  mcl_tp050_sqrt       2444 non-null   float64
 42  mcr_tp050_sqrt       2444 non-null   float64
 43  mm_tp050_sqrt        2444 non-null   float64
 44  ks_tp050_sqrt        2444 non-null   float64
 45  total_ms_tp060_sqrt  2444 non-null   float64
 46  mw_tp060_sqrt        2444 non-null   float64
 47  mwc_tp060_sqrt       2444 non-null   float64
 48  mcl_tp060_sqrt       2444 non-null   float64
 49  mcr_tp060_sqrt       2444 non-null   float64
 50  mm_tp060_sqrt        2444 non-null   float64
 51  ks_tp060_sqrt        2444 non-null   float64
 52  total_ms_tp070_sqrt  2444 non-null   float64
 53  mw_tp070_sqrt        2444 non-null   float64
 54  mwc_tp070_sqrt       2444 non-null   float64
 55  mcl_tp070_sqrt       2444 non-null   float64
 56  mcr_tp070_sqrt       2444 non-null   float64
 57  mm_tp070_sqrt        2444 non-null   float64
 58  ks_tp070_sqrt        2444 non-null   float64
 59  total_ms_tp080_sqrt  2444 non-null   float64
 60  mw_tp080_sqrt        2444 non-null   float64
 61  mwc_tp080_sqrt       2444 non-null   float64
 62  mcl_tp080_sqrt       2444 non-null   float64
 63  mcr_tp080_sqrt       2444 non-null   float64
 64  mm_tp080_sqrt        2444 non-null   float64
 65  ks_tp080_sqrt        2444 non-null   float64
 66  total_ms_tp090_sqrt  2444 non-null   float64
 67  mw_tp090_sqrt        2444 non-null   float64
 68  mwc_tp090_sqrt       2444 non-null   float64
 69  mcl_tp090_sqrt       2444 non-null   float64
 70  mcr_tp090_sqrt       2444 non-null   float64
 71  mm_tp090_sqrt        2444 non-null   float64
 72  ks_tp090_sqrt        2444 non-null   float64
 73  total_ms_tp100_sqrt  2444 non-null   float64
 74  mw_tp100_sqrt        2444 non-null   float64
 75  mwc_tp100_sqrt       2444 non-null   float64
 76  mcl_tp100_sqrt       2444 non-null   float64
 77  mcr_tp100_sqrt       2444 non-null   float64
 78  mm_tp100_sqrt        2444 non-null   float64
 79  ks_tp100_sqrt        2444 non-null   float64
 80  final_events         2444 non-null   float64
 81  final_trials         2444 non-null   float64
dtypes: float64(79), object(3)
memory usage: 1.5+ MB
In [27]:
sns.displot( data = final_lf_b.loc[final_lf_b.Qpass==1], x='sess', col='sid', col_wrap=2, 
            hue='sess', kind='hist', binwidth = 1, facet_kws={'sharey':False, 'sharex':False})

plt.show()
Output to input relationships via scatter plots¶
In [28]:
final_sqrt_lf = final_sqrt_df.melt(id_vars=['sess','sid','actv_grp','final_events','final_trials'], ignore_index=True).copy()
In [29]:
final_sqrt_lf.head()
Out[29]:
sess sid actv_grp final_events final_trials variable value
0 1 1 Aulaweb 2.0 2.0 total_ms_tp000_sqrt 89.442719
1 1 1 Blank 2.0 2.0 total_ms_tp000_sqrt 89.442719
2 1 1 Deeds 2.0 2.0 total_ms_tp000_sqrt 202.484567
3 1 1 Diagram 2.0 2.0 total_ms_tp000_sqrt 939.148551
4 1 1 Other 2.0 2.0 total_ms_tp000_sqrt 31.622777
In [30]:
final_sqrt_lf.actv_grp.unique()
Out[30]:
array(['Aulaweb', 'Blank', 'Deeds', 'Diagram', 'Other', 'Properties',
       'Study', 'TextEditor', 'Study_Materials', 'FSM_Related', 'FSM'],
      dtype=object)
In [31]:
actv_subgrp_1 = ['Aulaweb','Deeds','Diagram','TextEditor','FSM_Related','FSM']
In [32]:
actv_subgrp_2 = ['Blank','Other','Properties','Study','Study_Materials']
In [33]:
sns.relplot( data = final_sqrt_lf.loc[final_sqrt_lf.actv_grp.isin(actv_subgrp_1) & (final_sqrt_lf['sess']==1)], 
            x='final_events', y='value', 
            col='actv_grp', row='variable', hue='actv_grp',
            facet_kws={'sharey': False, 'sharex': False})

plt.show()
In [34]:
sns.relplot( data = final_sqrt_lf.loc[final_sqrt_lf.actv_grp.isin(actv_subgrp_1) & (final_sqrt_lf['sess']==2)], 
            x='final_events', y='value', 
            col='actv_grp', row='variable', hue='actv_grp',
            facet_kws={'sharey': False, 'sharex': False})

plt.show()
In [35]:
sns.relplot( data = final_sqrt_lf.loc[final_sqrt_lf.actv_grp.isin(actv_subgrp_1) & (final_sqrt_lf['sess']==3)], 
            x='final_events', y='value', 
            col='actv_grp', row='variable', hue='actv_grp',
            facet_kws={'sharey': False, 'sharex': False})

plt.show()
In [36]:
sns.relplot( data = final_sqrt_lf.loc[final_sqrt_lf.actv_grp.isin(actv_subgrp_1) & (final_sqrt_lf['sess']==4)], 
            x='final_events', y='value', 
            col='actv_grp', row='variable', hue='actv_grp',
            facet_kws={'sharey': False, 'sharex': False})

plt.show()
In [37]:
sns.relplot( data = final_sqrt_lf.loc[final_sqrt_lf.actv_grp.isin(actv_subgrp_1) & (final_sqrt_lf['sess']==5)], 
            x='final_events', y='value', 
            col='actv_grp', row='variable', hue='actv_grp',
            facet_kws={'sharey': False, 'sharex': False})

plt.show()
In [38]:
sns.relplot( data = final_sqrt_lf.loc[final_sqrt_lf.actv_grp.isin(actv_subgrp_1) & (final_sqrt_lf['sess']==6)], 
            x='final_events', y='value', 
            col='actv_grp', row='variable', hue='actv_grp',
            facet_kws={'sharey': False, 'sharex': False})

plt.show()
In [39]:
final_sqrt_df.columns
Out[39]:
Index(['sess', 'sid', 'actv_grp', 'total_ms_tp000_sqrt', 'mw_tp000_sqrt',
       'mwc_tp000_sqrt', 'mcl_tp000_sqrt', 'mcr_tp000_sqrt', 'mm_tp000_sqrt',
       'ks_tp000_sqrt', 'total_ms_tp010_sqrt', 'mw_tp010_sqrt',
       'mwc_tp010_sqrt', 'mcl_tp010_sqrt', 'mcr_tp010_sqrt', 'mm_tp010_sqrt',
       'ks_tp010_sqrt', 'total_ms_tp020_sqrt', 'mw_tp020_sqrt',
       'mwc_tp020_sqrt', 'mcl_tp020_sqrt', 'mcr_tp020_sqrt', 'mm_tp020_sqrt',
       'ks_tp020_sqrt', 'total_ms_tp030_sqrt', 'mw_tp030_sqrt',
       'mwc_tp030_sqrt', 'mcl_tp030_sqrt', 'mcr_tp030_sqrt', 'mm_tp030_sqrt',
       'ks_tp030_sqrt', 'total_ms_tp040_sqrt', 'mw_tp040_sqrt',
       'mwc_tp040_sqrt', 'mcl_tp040_sqrt', 'mcr_tp040_sqrt', 'mm_tp040_sqrt',
       'ks_tp040_sqrt', 'total_ms_tp050_sqrt', 'mw_tp050_sqrt',
       'mwc_tp050_sqrt', 'mcl_tp050_sqrt', 'mcr_tp050_sqrt', 'mm_tp050_sqrt',
       'ks_tp050_sqrt', 'total_ms_tp060_sqrt', 'mw_tp060_sqrt',
       'mwc_tp060_sqrt', 'mcl_tp060_sqrt', 'mcr_tp060_sqrt', 'mm_tp060_sqrt',
       'ks_tp060_sqrt', 'total_ms_tp070_sqrt', 'mw_tp070_sqrt',
       'mwc_tp070_sqrt', 'mcl_tp070_sqrt', 'mcr_tp070_sqrt', 'mm_tp070_sqrt',
       'ks_tp070_sqrt', 'total_ms_tp080_sqrt', 'mw_tp080_sqrt',
       'mwc_tp080_sqrt', 'mcl_tp080_sqrt', 'mcr_tp080_sqrt', 'mm_tp080_sqrt',
       'ks_tp080_sqrt', 'total_ms_tp090_sqrt', 'mw_tp090_sqrt',
       'mwc_tp090_sqrt', 'mcl_tp090_sqrt', 'mcr_tp090_sqrt', 'mm_tp090_sqrt',
       'ks_tp090_sqrt', 'total_ms_tp100_sqrt', 'mw_tp100_sqrt',
       'mwc_tp100_sqrt', 'mcl_tp100_sqrt', 'mcr_tp100_sqrt', 'mm_tp100_sqrt',
       'ks_tp100_sqrt', 'final_events', 'final_trials'],
      dtype='object')
In [40]:
totl_vars = get_var_list_b(final_sqrt_df,['total'])
mw_vars = get_var_list_b(final_sqrt_df,['mw_'])
mwc_vars = get_var_list_b(final_sqrt_df,['mwc'])
mcl_vars = get_var_list_b(final_sqrt_df,['mcl'])
mcr_vars = get_var_list_b(final_sqrt_df,['mcr'])
mm_vars = get_var_list_b(final_sqrt_df,['mm'])
ks_vars = get_var_list_b(final_sqrt_df,['mws'])
Bar chart shows the inputs have different scales¶
In [41]:
sns.catplot(data = final_sqrt_df, kind='box', aspect=3.5)

plt.show()
Correlation plot shows the inputs are highly correlated¶
In [42]:
fig, ax = plt.subplots(figsize=(12, 8))

sns.heatmap(data = final_sqrt_df[sqrt_feature_names].corr(), 
            vmin=-1, vmax=1, center = 0,
            cmap='coolwarm', 
            ax=ax)

plt.show()

Standardize the numeric inputs¶

In [43]:
Xtimepoints = StandardScaler().fit_transform( sqrt_features_df )
In [44]:
Xtimepoints.shape
Out[44]:
(2444, 77)
In [45]:
sns.catplot(data = pd.DataFrame(Xtimepoints, columns=sqrt_feature_names), kind='box', aspect=3.5)

plt.show()

Poisson Regression with standardized features¶

In [46]:
final_sqrt_std_df = pd.concat([final_sqrt_df.loc[:,['sess','sid','actv_grp','final_events','final_trials']].copy(), pd.DataFrame(Xtimepoints, columns=sqrt_feature_names).copy()], axis=1)
In [47]:
final_sqrt_std_df.head()
Out[47]:
sess sid actv_grp final_events final_trials total_ms_tp000_sqrt mw_tp000_sqrt mwc_tp000_sqrt mcl_tp000_sqrt mcr_tp000_sqrt ... mcr_tp090_sqrt mm_tp090_sqrt ks_tp090_sqrt total_ms_tp100_sqrt mw_tp100_sqrt mwc_tp100_sqrt mcl_tp100_sqrt mcr_tp100_sqrt mm_tp100_sqrt ks_tp100_sqrt
0 1 1 Aulaweb 2.0 2.0 -0.790076 -0.521312 -0.182566 -0.764020 -0.612717 ... 1.005451 0.955513 -0.736663 0.045447 -0.222258 -0.340979 0.889260 1.185153 1.194975 -0.717268
1 1 1 Blank 2.0 2.0 -0.790076 -0.521312 -0.182566 -0.764020 -0.612717 ... 1.027941 0.992772 -0.733079 0.042272 -0.224584 -0.340979 0.885189 1.185153 1.191882 -0.717268
2 1 1 Deeds 2.0 2.0 -0.594451 -0.214631 -0.182566 -0.638176 -0.612717 ... 1.007958 0.969835 -0.737460 -0.099400 -0.248136 -0.340979 0.763366 1.129168 1.095563 -0.783181
3 1 1 Diagram 2.0 2.0 0.680382 0.175782 -0.182566 0.907550 1.709392 ... 0.997917 0.924780 -0.755483 0.010015 -0.248136 -0.340979 0.845303 1.174033 1.161415 -0.783181
4 1 1 Other 2.0 2.0 -0.890136 -0.521312 -0.182566 -0.935924 -0.612717 ... 1.027941 1.034190 -0.721576 0.042272 -0.224584 -0.340979 0.884171 1.185153 1.191564 -0.717268

5 rows × 82 columns

In [48]:
num_features_str = ''
for ix, x in enumerate(sqrt_feature_names):
    if ix == len(sqrt_feature_names) - 1:
        num_features_str = num_features_str + x
    else:
        num_features_str = num_features_str + x + ' + '
num_features_str
Out[48]:
'total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt'
In [49]:
descriptive_formulas = ['final_events ~ sid'
                        ,'final_events ~ sid + actv_grp'
                        ,'final_events ~ sid + actv_grp + ' + num_features_str
                        ,'final_events ~ sid * (' + num_features_str + ')'
                        ,'final_events ~ sid * (actv_grp + ' + num_features_str + ')'
                       ]
In [50]:
predictive_formulas = ['final_events ~ ' + num_features_str
                    ,'final_events ~ (' + num_features_str + ')**2'
                    ,'final_events ~ actv_grp + ' + num_features_str
                    ,'final_events ~ actv_grp * (' + num_features_str + ')'
                    ,'final_events ~ actv_grp + (' + num_features_str + ')**2'
                    ,'final_events ~ actv_grp * (' + num_features_str + ')**2'
                   ]
In [51]:
test_formula_list = descriptive_formulas + predictive_formulas
In [52]:
test_formula_list
Out[52]:
['final_events ~ sid',
 'final_events ~ sid + actv_grp',
 'final_events ~ sid + actv_grp + total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt',
 'final_events ~ sid * (total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt)',
 'final_events ~ sid * (actv_grp + total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt)',
 'final_events ~ total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt',
 'final_events ~ (total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt)**2',
 'final_events ~ actv_grp + total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt',
 'final_events ~ actv_grp * (total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt)',
 'final_events ~ actv_grp + (total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt)**2',
 'final_events ~ actv_grp * (total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt)**2']
Evaluate number of features with dmatrices¶
In [53]:
sk_list = make_dmat(final_sqrt_std_df, test_formula_list)
In [54]:
model_dim = make_dim_df(final_sqrt_std_df, sk_list, test_formula_list)
In [55]:
model_dim
Out[55]:
model name dimensions number of obs dim < obs
0 0 62 2444 Yes
1 1 72 2444 Yes
2 2 149 2444 Yes
3 3 4836 2444 No
4 4 5456 2444 No
5 5 78 2444 Yes
6 6 3004 2444 No
7 7 88 2444 Yes
8 8 858 2444 Yes
9 9 3014 2444 No
10 10 33044 2444 No
In [56]:
adjust_desc_formulas = ['final_events ~ sid'
                        ,'final_events ~ sid + actv_grp'
                        ,'final_events ~ sid + actv_grp + ' + num_features_str
                       ]
In [57]:
adjust_pred_formulas = ['final_events ~ ' + num_features_str
                    ,'final_events ~ actv_grp + ' + num_features_str
                    #,'final_events ~ actv_grp * (' + num_features_str + ')'
                   ]
In [58]:
formula_list = adjust_desc_formulas + adjust_pred_formulas
In [59]:
model_list = []

for a_formula in formula_list:
    #model_list.append( smf.poisson( formula = a_formula, data = final_sqrt_std_df).fit(method='bfgs') )
    model_list.append( smf.poisson( formula = a_formula, data = final_sqrt_std_df).fit(method='ncg') )
    #model_list.append( smf.poisson( formula = a_formula, data = final_sqrt_std_df).fit(method='lbfgs') )
    #model_list.append( smf.poisson( formula = a_formula, data = final_sqrt_std_df).fit(method='powell') )
    #model_list.append( smf.poisson( formula = a_formula, data = final_sqrt_std_df).fit(method='newton') )
    #model_list.append( smf.poisson( formula = a_formula, data = final_sqrt_std_df).fit(method='cg') )
    #model_list.append( smf.poisson( formula = a_formula, data = final_sqrt_std_df).fit(method='basinhopping') )
Optimization terminated successfully.
         Current function value: 1.543124
         Iterations: 14
         Function evaluations: 15
         Gradient evaluations: 15
         Hessian evaluations: 14
Optimization terminated successfully.
         Current function value: 1.525616
         Iterations: 14
         Function evaluations: 15
         Gradient evaluations: 15
         Hessian evaluations: 14
Optimization terminated successfully.
         Current function value: 1.261055
         Iterations: 25
         Function evaluations: 26
         Gradient evaluations: 26
         Hessian evaluations: 25
Optimization terminated successfully.
         Current function value: 1.429776
         Iterations: 14
         Function evaluations: 15
         Gradient evaluations: 15
         Hessian evaluations: 14
Optimization terminated successfully.
         Current function value: 1.422724
         Iterations: 14
         Function evaluations: 15
         Gradient evaluations: 15
         Hessian evaluations: 14
In [60]:
model_results = pd.DataFrame({'model_name': ['mod00','mod01','mod02','mod03','mod04'],
                              'AIC': [mod.aic for mod in model_list],
                              'BIC': [mod.bic for mod in model_list],
                              'Prsquared': [mod.prsquared for mod in model_list]})
In [61]:
sns.relplot(data = model_results.melt(id_vars=['model_name']),
            x='model_name',
            y='value', 
            col='variable',
            col_wrap=2,
            facet_kws = {'sharey': False})

plt.show()
In [62]:
print(model_list[2].summary())
                          Poisson Regression Results                          
==============================================================================
Dep. Variable:           final_events   No. Observations:                 2444
Model:                        Poisson   Df Residuals:                     2295
Method:                           MLE   Df Model:                          148
Date:                Thu, 27 Apr 2023   Pseudo R-squ.:                  0.2654
Time:                        07:52:47   Log-Likelihood:                -3082.0
converged:                       True   LL-Null:                       -4195.6
Covariance Type:            nonrobust   LLR p-value:                     0.000
===============================================================================================
                                  coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                       0.7877      0.132      5.972      0.000       0.529       1.046
sid[T.2]                       -0.7525      0.181     -4.147      0.000      -1.108      -0.397
sid[T.4]                       -1.4534      0.216     -6.716      0.000      -1.878      -1.029
sid[T.5]                       -0.4380      0.172     -2.552      0.011      -0.774      -0.102
sid[T.7]                       -0.4922      0.166     -2.957      0.003      -0.819      -0.166
sid[T.8]                       -1.4308      0.243     -5.885      0.000      -1.907      -0.954
sid[T.9]                       -1.0395      0.194     -5.363      0.000      -1.419      -0.660
sid[T.11]                      -0.0999      0.167     -0.599      0.549      -0.427       0.227
sid[T.12]                      -0.7656      0.196     -3.915      0.000      -1.149      -0.382
sid[T.14]                      -0.2179      0.161     -1.352      0.176      -0.534       0.098
sid[T.19]                      -0.5119      0.187     -2.735      0.006      -0.879      -0.145
sid[T.20]                       0.1848      0.155      1.190      0.234      -0.120       0.489
sid[T.22]                      -1.5446      0.291     -5.304      0.000      -2.115      -0.974
sid[T.24]                      -0.9950      0.181     -5.486      0.000      -1.350      -0.640
sid[T.25]                      -0.5366      0.198     -2.715      0.007      -0.924      -0.149
sid[T.30]                      -0.4144      0.175     -2.362      0.018      -0.758      -0.071
sid[T.33]                     -14.2981    344.152     -0.042      0.967    -688.824     660.228
sid[T.34]                      -0.8755      0.186     -4.704      0.000      -1.240      -0.511
sid[T.37]                      -1.2331      0.259     -4.763      0.000      -1.741      -0.726
sid[T.38]                      -0.8374      0.184     -4.562      0.000      -1.197      -0.478
sid[T.39]                      -0.6275      0.194     -3.242      0.001      -1.007      -0.248
sid[T.42]                      -1.4618      0.215     -6.809      0.000      -1.883      -1.041
sid[T.44]                       0.4767      0.187      2.555      0.011       0.111       0.842
sid[T.45]                       0.0498      0.194      0.257      0.797      -0.330       0.429
sid[T.46]                      -1.3844      0.285     -4.859      0.000      -1.943      -0.826
sid[T.47]                      -1.1223      0.205     -5.471      0.000      -1.524      -0.720
sid[T.49]                      -1.2151      0.193     -6.299      0.000      -1.593      -0.837
sid[T.51]                      -1.4011      0.223     -6.297      0.000      -1.837      -0.965
sid[T.52]                      -1.4096      0.213     -6.626      0.000      -1.827      -0.993
sid[T.54]                      -0.7311      0.183     -3.998      0.000      -1.089      -0.373
sid[T.55]                       0.0382      0.175      0.219      0.827      -0.304       0.381
sid[T.56]                      -0.1387      0.165     -0.839      0.401      -0.463       0.185
sid[T.57]                     -14.6875    344.152     -0.043      0.966    -689.213     659.838
sid[T.58]                      -0.7754      0.439     -1.766      0.077      -1.636       0.085
sid[T.59]                      -1.2824      0.222     -5.780      0.000      -1.717      -0.848
sid[T.60]                     -15.7662    344.152     -0.046      0.963    -690.292     658.760
sid[T.61]                      -0.5793      0.187     -3.096      0.002      -0.946      -0.213
sid[T.62]                      -0.7993      0.283     -2.820      0.005      -1.355      -0.244
sid[T.64]                     -15.7767    344.152     -0.046      0.963    -690.302     658.749
sid[T.67]                      -0.2502      0.188     -1.333      0.182      -0.618       0.118
sid[T.68]                       0.0169      0.158      0.107      0.915      -0.293       0.327
sid[T.69]                      -0.7510      0.223     -3.374      0.001      -1.187      -0.315
sid[T.70]                      -0.6597      0.186     -3.552      0.000      -1.024      -0.296
sid[T.71]                      -0.4225      0.202     -2.090      0.037      -0.819      -0.026
sid[T.73]                      -1.0531      0.208     -5.072      0.000      -1.460      -0.646
sid[T.75]                       0.0371      0.171      0.217      0.828      -0.298       0.372
sid[T.77]                      -0.4970      0.438     -1.135      0.256      -1.355       0.361
sid[T.79]                      -0.7506      0.180     -4.179      0.000      -1.103      -0.399
sid[T.80]                      -0.8105      0.194     -4.182      0.000      -1.190      -0.431
sid[T.82]                      -1.8675      0.222     -8.394      0.000      -2.304      -1.431
sid[T.83]                      -1.6625      0.217     -7.659      0.000      -2.088      -1.237
sid[T.87]                      -0.4249      0.168     -2.532      0.011      -0.754      -0.096
sid[T.91]                      -1.1406      0.189     -6.044      0.000      -1.510      -0.771
sid[T.92]                      -0.6683      0.192     -3.473      0.001      -1.046      -0.291
sid[T.94]                      -0.5126      0.165     -3.098      0.002      -0.837      -0.188
sid[T.95]                      -1.1921      0.198     -6.023      0.000      -1.580      -0.804
sid[T.99]                      -1.2694      0.228     -5.575      0.000      -1.716      -0.823
sid[T.101]                     -1.0588      0.206     -5.150      0.000      -1.462      -0.656
sid[T.102]                     -1.1561      0.209     -5.526      0.000      -1.566      -0.746
sid[T.103]                    -16.1082    344.152     -0.047      0.963    -690.634     658.417
sid[T.104]                     -0.9619      0.285     -3.379      0.001      -1.520      -0.404
sid[T.106]                      0.4203      0.273      1.538      0.124      -0.115       0.956
actv_grp[T.Blank]               0.0799      0.077      1.038      0.299      -0.071       0.231
actv_grp[T.Deeds]               0.0637      0.077      0.826      0.409      -0.087       0.215
actv_grp[T.Diagram]            -0.0284      0.076     -0.373      0.709      -0.178       0.121
actv_grp[T.FSM]                -0.5538      0.240     -2.311      0.021      -1.023      -0.084
actv_grp[T.FSM_Related]        -0.4768      0.186     -2.558      0.011      -0.842      -0.112
actv_grp[T.Other]               0.1131      0.079      1.430      0.153      -0.042       0.268
actv_grp[T.Properties]         -0.0046      0.075     -0.062      0.950      -0.151       0.142
actv_grp[T.Study]               0.0824      0.078      1.052      0.293      -0.071       0.236
actv_grp[T.Study_Materials]    -0.0188      0.181     -0.104      0.917      -0.374       0.336
actv_grp[T.TextEditor]          0.0440      0.077      0.571      0.568      -0.107       0.195
total_ms_tp000_sqrt             0.0630      0.079      0.794      0.427      -0.092       0.218
mw_tp000_sqrt                   0.0172      0.034      0.503      0.615      -0.050       0.084
mwc_tp000_sqrt                 -0.0254      0.028     -0.903      0.367      -0.081       0.030
mcl_tp000_sqrt                 -0.0184      0.117     -0.158      0.875      -0.247       0.210
mcr_tp000_sqrt                 -0.0937      0.037     -2.557      0.011      -0.165      -0.022
mm_tp000_sqrt                   0.1445      0.095      1.529      0.126      -0.041       0.330
ks_tp000_sqrt                  -0.0766      0.045     -1.707      0.088      -0.165       0.011
total_ms_tp010_sqrt             0.1653      0.098      1.678      0.093      -0.028       0.358
mw_tp010_sqrt                   0.0178      0.058      0.309      0.757      -0.095       0.131
mwc_tp010_sqrt                  0.0184      0.047      0.391      0.696      -0.074       0.111
mcl_tp010_sqrt                  0.0964      0.145      0.666      0.506      -0.188       0.380
mcr_tp010_sqrt                 -0.0153      0.047     -0.329      0.742      -0.107       0.076
mm_tp010_sqrt                  -0.3221      0.155     -2.083      0.037      -0.625      -0.019
ks_tp010_sqrt                   0.0438      0.061      0.712      0.477      -0.077       0.164
total_ms_tp020_sqrt             0.0586      0.122      0.481      0.630      -0.180       0.297
mw_tp020_sqrt                  -0.1429      0.090     -1.595      0.111      -0.319       0.033
mwc_tp020_sqrt                  0.0407      0.081      0.504      0.614      -0.117       0.199
mcl_tp020_sqrt                 -0.3378      0.200     -1.692      0.091      -0.729       0.053
mcr_tp020_sqrt                  0.0323      0.065      0.501      0.617      -0.094       0.159
mm_tp020_sqrt                   0.4155      0.240      1.730      0.084      -0.055       0.886
ks_tp020_sqrt                  -0.0630      0.084     -0.754      0.451      -0.227       0.101
total_ms_tp030_sqrt             0.1029      0.156      0.661      0.509      -0.202       0.408
mw_tp030_sqrt                   0.2046      0.112      1.832      0.067      -0.014       0.424
mwc_tp030_sqrt                 -0.0238      0.093     -0.256      0.798      -0.206       0.158
mcl_tp030_sqrt                  0.2395      0.262      0.915      0.360      -0.274       0.752
mcr_tp030_sqrt                  0.3711      0.094      3.968      0.000       0.188       0.554
mm_tp030_sqrt                  -0.5317      0.306     -1.736      0.083      -1.132       0.069
ks_tp030_sqrt                  -0.1677      0.107     -1.567      0.117      -0.377       0.042
total_ms_tp040_sqrt             0.3439      0.190      1.809      0.070      -0.029       0.717
mw_tp040_sqrt                  -0.2189      0.125     -1.749      0.080      -0.464       0.026
mwc_tp040_sqrt                  0.0504      0.094      0.537      0.591      -0.133       0.234
mcl_tp040_sqrt                 -0.2796      0.306     -0.913      0.361      -0.880       0.320
mcr_tp040_sqrt                  0.1229      0.116      1.057      0.291      -0.105       0.351
mm_tp040_sqrt                  -0.0419      0.375     -0.112      0.911      -0.776       0.692
ks_tp040_sqrt                  -0.1506      0.127     -1.184      0.236      -0.400       0.099
total_ms_tp050_sqrt             0.2209      0.211      1.045      0.296      -0.193       0.635
mw_tp050_sqrt                   0.6338      0.190      3.329      0.001       0.261       1.007
mwc_tp050_sqrt                 -0.1184      0.135     -0.879      0.379      -0.382       0.146
mcl_tp050_sqrt                 -0.1198      0.347     -0.346      0.730      -0.800       0.560
mcr_tp050_sqrt                  0.0621      0.129      0.482      0.630      -0.190       0.314
mm_tp050_sqrt                  -0.3126      0.426     -0.734      0.463      -1.147       0.522
ks_tp050_sqrt                   0.0126      0.133      0.095      0.924      -0.248       0.273
total_ms_tp060_sqrt            -0.2903      0.239     -1.213      0.225      -0.759       0.179
mw_tp060_sqrt                  -0.4712      0.234     -2.011      0.044      -0.930      -0.012
mwc_tp060_sqrt                  0.1481      0.217      0.681      0.496      -0.278       0.574
mcl_tp060_sqrt                 -0.5506      0.389     -1.414      0.157      -1.314       0.213
mcr_tp060_sqrt                 -0.0315      0.160     -0.196      0.844      -0.346       0.283
mm_tp060_sqrt                   0.9775      0.492      1.987      0.047       0.013       1.942
ks_tp060_sqrt                   0.2170      0.156      1.390      0.165      -0.089       0.523
total_ms_tp070_sqrt             0.2821      0.255      1.107      0.268      -0.217       0.782
mw_tp070_sqrt                   0.2210      0.232      0.954      0.340      -0.233       0.675
mwc_tp070_sqrt                  0.1204      0.258      0.467      0.640      -0.385       0.625
mcl_tp070_sqrt                  0.7183      0.420      1.709      0.087      -0.105       1.542
mcr_tp070_sqrt                 -0.5062      0.178     -2.852      0.004      -0.854      -0.158
mm_tp070_sqrt                  -0.9527      0.541     -1.760      0.078      -2.014       0.108
ks_tp070_sqrt                  -0.2045      0.181     -1.129      0.259      -0.559       0.151
total_ms_tp080_sqrt             0.3415      0.267      1.278      0.201      -0.182       0.865
mw_tp080_sqrt                   0.2295      0.292      0.786      0.432      -0.343       0.802
mwc_tp080_sqrt                 -0.0725      0.341     -0.213      0.832      -0.740       0.595
mcl_tp080_sqrt                 -0.3767      0.484     -0.778      0.436      -1.325       0.572
mcr_tp080_sqrt                 -0.1065      0.214     -0.498      0.619      -0.526       0.313
mm_tp080_sqrt                   0.0419      0.635      0.066      0.947      -1.203       1.286
ks_tp080_sqrt                   0.1794      0.201      0.893      0.372      -0.214       0.573
total_ms_tp090_sqrt            -0.1998      0.279     -0.715      0.475      -0.747       0.348
mw_tp090_sqrt                  -0.3483      0.324     -1.076      0.282      -0.983       0.286
mwc_tp090_sqrt                 -0.2544      0.330     -0.771      0.441      -0.901       0.392
mcl_tp090_sqrt                  0.5542      0.493      1.123      0.261      -0.413       1.521
mcr_tp090_sqrt                 -0.1256      0.252     -0.498      0.618      -0.619       0.368
mm_tp090_sqrt                  -0.2737      0.669     -0.409      0.683      -1.586       1.038
ks_tp090_sqrt                   0.2912      0.206      1.411      0.158      -0.113       0.696
total_ms_tp100_sqrt            -0.6287      0.179     -3.517      0.000      -0.979      -0.278
mw_tp100_sqrt                   0.0136      0.208      0.065      0.948      -0.395       0.422
mwc_tp100_sqrt                  0.1678      0.225      0.746      0.456      -0.273       0.609
mcl_tp100_sqrt                 -0.0073      0.335     -0.022      0.983      -0.665       0.650
mcr_tp100_sqrt                  0.5504      0.183      3.011      0.003       0.192       0.909
mm_tp100_sqrt                   0.5782      0.461      1.254      0.210      -0.326       1.482
ks_tp100_sqrt                   0.0676      0.128      0.526      0.599      -0.184       0.319
===============================================================================================
In [63]:
my_coefplot(model_list[2])

Run PCA¶

This notebook standardizes the variables and performs PCA

In [64]:
features_df = sqrt_features_df.copy()
In [65]:
feature_names = sqrt_features_df.columns
In [66]:
%run CMPINF2120_EPM_PCA_INCL_Over_Lisa.ipynb
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2444 entries, 0 to 2443
Data columns (total 77 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   PC01    2444 non-null   float64
 1   PC02    2444 non-null   float64
 2   PC03    2444 non-null   float64
 3   PC04    2444 non-null   float64
 4   PC05    2444 non-null   float64
 5   PC06    2444 non-null   float64
 6   PC07    2444 non-null   float64
 7   PC08    2444 non-null   float64
 8   PC09    2444 non-null   float64
 9   PC10    2444 non-null   float64
 10  PC11    2444 non-null   float64
 11  PC12    2444 non-null   float64
 12  PC13    2444 non-null   float64
 13  PC14    2444 non-null   float64
 14  PC15    2444 non-null   float64
 15  PC16    2444 non-null   float64
 16  PC17    2444 non-null   float64
 17  PC18    2444 non-null   float64
 18  PC19    2444 non-null   float64
 19  PC20    2444 non-null   float64
 20  PC21    2444 non-null   float64
 21  PC22    2444 non-null   float64
 22  PC23    2444 non-null   float64
 23  PC24    2444 non-null   float64
 24  PC25    2444 non-null   float64
 25  PC26    2444 non-null   float64
 26  PC27    2444 non-null   float64
 27  PC28    2444 non-null   float64
 28  PC29    2444 non-null   float64
 29  PC30    2444 non-null   float64
 30  PC31    2444 non-null   float64
 31  PC32    2444 non-null   float64
 32  PC33    2444 non-null   float64
 33  PC34    2444 non-null   float64
 34  PC35    2444 non-null   float64
 35  PC36    2444 non-null   float64
 36  PC37    2444 non-null   float64
 37  PC38    2444 non-null   float64
 38  PC39    2444 non-null   float64
 39  PC40    2444 non-null   float64
 40  PC41    2444 non-null   float64
 41  PC42    2444 non-null   float64
 42  PC43    2444 non-null   float64
 43  PC44    2444 non-null   float64
 44  PC45    2444 non-null   float64
 45  PC46    2444 non-null   float64
 46  PC47    2444 non-null   float64
 47  PC48    2444 non-null   float64
 48  PC49    2444 non-null   float64
 49  PC50    2444 non-null   float64
 50  PC51    2444 non-null   float64
 51  PC52    2444 non-null   float64
 52  PC53    2444 non-null   float64
 53  PC54    2444 non-null   float64
 54  PC55    2444 non-null   float64
 55  PC56    2444 non-null   float64
 56  PC57    2444 non-null   float64
 57  PC58    2444 non-null   float64
 58  PC59    2444 non-null   float64
 59  PC60    2444 non-null   float64
 60  PC61    2444 non-null   float64
 61  PC62    2444 non-null   float64
 62  PC63    2444 non-null   float64
 63  PC64    2444 non-null   float64
 64  PC65    2444 non-null   float64
 65  PC66    2444 non-null   float64
 66  PC67    2444 non-null   float64
 67  PC68    2444 non-null   float64
 68  PC69    2444 non-null   float64
 69  PC70    2444 non-null   float64
 70  PC71    2444 non-null   float64
 71  PC72    2444 non-null   float64
 72  PC73    2444 non-null   float64
 73  PC74    2444 non-null   float64
 74  PC75    2444 non-null   float64
 75  PC76    2444 non-null   float64
 76  PC77    2444 non-null   float64
dtypes: float64(77)
memory usage: 1.4 MB
In [67]:
first_pc_scores_df = pc_scores_df.copy()

Create a dataset with key, categorical, and output variables from input_sqrt_df and PCs from pc_scores_df¶

In [68]:
first_pc_scores_df.head()
Out[68]:
PC01 PC02 PC03 PC04 PC05 PC06 PC07 PC08 PC09 PC10 ... PC68 PC69 PC70 PC71 PC72 PC73 PC74 PC75 PC76 PC77
0 -1.343494 0.614381 3.048655 -2.980032 -0.717602 -2.483105 -0.946123 -0.109224 0.802887 0.573025 ... 0.021226 0.007764 -0.042438 0.007726 0.016882 -0.033860 -0.020212 -0.020811 0.002340 0.013947
1 2.264423 -0.252348 3.635242 -1.695365 -0.567017 -2.818182 -1.183510 -1.495354 0.493679 1.094633 ... -0.054488 0.016730 0.010136 -0.022828 -0.049764 -0.025509 -0.011965 -0.003058 0.010145 -0.003176
2 2.407197 -0.285384 3.514516 -1.835526 -0.700168 -2.871521 -1.200218 -1.409066 0.499980 0.042987 ... -0.008969 -0.002833 -0.015106 0.000015 -0.020132 -0.009533 -0.014708 0.001780 0.028581 -0.000724
3 1.800267 -0.177009 3.836746 -0.226538 0.360890 -3.378267 -1.494108 1.350565 1.428391 -0.292023 ... -0.052102 0.019233 -0.019455 -0.021534 -0.068026 -0.055084 0.009613 0.001835 -0.002256 -0.030537
4 2.285621 -0.236315 3.690869 -1.931407 -0.768909 -2.861841 -1.110547 -1.684833 0.437358 0.939661 ... -0.031325 -0.003566 0.041341 -0.008955 -0.027152 -0.000594 -0.012622 0.027929 0.006442 -0.031819

5 rows × 77 columns

In [69]:
final_sqrt_df.head()
Out[69]:
sess sid actv_grp total_ms_tp000_sqrt mw_tp000_sqrt mwc_tp000_sqrt mcl_tp000_sqrt mcr_tp000_sqrt mm_tp000_sqrt ks_tp000_sqrt ... ks_tp090_sqrt total_ms_tp100_sqrt mw_tp100_sqrt mwc_tp100_sqrt mcl_tp100_sqrt mcr_tp100_sqrt mm_tp100_sqrt ks_tp100_sqrt final_events final_trials
0 1 1 Aulaweb 89.442719 0.000000 0.0 2.000000 0.000000 21.931712 0.000000 ... 31.572140 2563.981279 21.213203 0.0 65.696271 17.146428 570.063154 34.409301 2.0 2.0
1 1 1 Blank 89.442719 0.000000 0.0 2.000000 0.000000 23.237900 0.000000 ... 31.629101 2562.420730 21.166010 0.0 65.635356 17.146428 569.584937 34.409301 2.0 2.0
2 1 1 Deeds 202.484567 2.449490 0.0 3.464102 0.000000 46.054316 2.000000 ... 31.559468 2492.789602 20.688161 0.0 63.812225 16.852300 554.692708 33.346664 2.0 2.0
3 1 1 Diagram 939.148551 5.567764 0.0 21.447611 7.348469 183.891272 4.582576 ... 31.272992 2546.566316 20.688161 0.0 65.038450 17.088007 564.874322 33.346664 2.0 2.0
4 1 1 Other 31.622777 0.000000 0.0 0.000000 0.000000 9.165151 0.000000 ... 31.811947 2562.420730 21.166010 0.0 65.620119 17.146428 569.535776 34.409301 2.0 2.0

5 rows × 82 columns

In [70]:
final_sqrt_df.loc[:, ['sess','sid','actv_grp','final_events','final_trials']]
Out[70]:
sess sid actv_grp final_events final_trials
0 1 1 Aulaweb 2.0 2.0
1 1 1 Blank 2.0 2.0
2 1 1 Deeds 2.0 2.0
3 1 1 Diagram 2.0 2.0
4 1 1 Other 2.0 2.0
... ... ... ... ... ...
2439 6 102 FSM_Related 0.0 2.0
2440 6 102 Other 0.0 2.0
2441 6 102 Properties 0.0 2.0
2442 6 102 Study 0.0 2.0
2443 6 102 TextEditor 0.0 2.0

2444 rows × 5 columns

In [71]:
pd.concat([final_sqrt_df.loc[:,['sess','sid','actv_grp','final_events','final_trials']].copy(), first_pc_scores_df], axis=1)
Out[71]:
sess sid actv_grp final_events final_trials PC01 PC02 PC03 PC04 PC05 ... PC68 PC69 PC70 PC71 PC72 PC73 PC74 PC75 PC76 PC77
0 1 1 Aulaweb 2.0 2.0 -1.343494 0.614381 3.048655 -2.980032 -0.717602 ... 0.021226 0.007764 -0.042438 0.007726 0.016882 -0.033860 -0.020212 -0.020811 0.002340 0.013947
1 1 1 Blank 2.0 2.0 2.264423 -0.252348 3.635242 -1.695365 -0.567017 ... -0.054488 0.016730 0.010136 -0.022828 -0.049764 -0.025509 -0.011965 -0.003058 0.010145 -0.003176
2 1 1 Deeds 2.0 2.0 2.407197 -0.285384 3.514516 -1.835526 -0.700168 ... -0.008969 -0.002833 -0.015106 0.000015 -0.020132 -0.009533 -0.014708 0.001780 0.028581 -0.000724
3 1 1 Diagram 2.0 2.0 1.800267 -0.177009 3.836746 -0.226538 0.360890 ... -0.052102 0.019233 -0.019455 -0.021534 -0.068026 -0.055084 0.009613 0.001835 -0.002256 -0.030537
4 1 1 Other 2.0 2.0 2.285621 -0.236315 3.690869 -1.931407 -0.768909 ... -0.031325 -0.003566 0.041341 -0.008955 -0.027152 -0.000594 -0.012622 0.027929 0.006442 -0.031819
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2439 6 102 FSM_Related 0.0 2.0 1.539805 -0.209572 5.056273 -0.625754 1.067768 ... -0.051070 0.002365 -0.015540 0.000878 -0.008006 0.012859 -0.005245 -0.003537 -0.004329 0.007108
2440 6 102 Other 0.0 2.0 -3.926028 1.379748 3.355224 -4.322836 0.003769 ... -0.022156 0.008253 -0.027522 -0.007256 0.063841 -0.014692 0.001131 -0.003470 0.003893 -0.011124
2441 6 102 Properties 0.0 2.0 1.810717 -0.137528 5.400015 -0.807313 1.201469 ... -0.004866 -0.000190 -0.007158 -0.008692 -0.018735 -0.015721 -0.015870 0.008795 -0.032018 0.004263
2442 6 102 Study 0.0 2.0 0.895392 0.111307 4.748045 -2.764785 -0.059818 ... 0.002408 -0.067429 0.027239 0.034184 0.016383 -0.008254 0.006501 -0.013523 -0.007994 0.010186
2443 6 102 TextEditor 0.0 2.0 -0.263473 0.345440 4.523714 -1.698905 0.928430 ... 0.061311 -0.020104 -0.030066 0.015619 0.085143 -0.015343 -0.024832 0.008714 0.012625 0.005287

2444 rows × 82 columns

In [72]:
pc_df_to_model = pd.concat([final_sqrt_df.loc[:,['sess','sid','actv_grp','final_events','final_trials']].copy(), pc_scores_df.copy()], axis=1)
In [73]:
pc_features = ['PC01','PC02','PC03','PC04','PC05','PC06','PC07','PC08']
In [74]:
pc_features_str = ''
for ix, x in enumerate(pc_features):
    if ix == len(pc_features) - 1:
        pc_features_str = pc_features_str + x
    else:
        pc_features_str = pc_features_str + x + ' + '
pc_features_str
Out[74]:
'PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08'
In [75]:
pc_y, pc_X = dmatrices('final_events ~ (' + pc_features_str + ')**2', data=pc_df_to_model, return_type='dataframe')
In [76]:
pc_X.head()
Out[76]:
Intercept PC01 PC02 PC03 PC04 PC05 PC06 PC07 PC08 PC01:PC02 ... PC04:PC05 PC04:PC06 PC04:PC07 PC04:PC08 PC05:PC06 PC05:PC07 PC05:PC08 PC06:PC07 PC06:PC08 PC07:PC08
0 1.0 -1.343494 0.614381 3.048655 -2.980032 -0.717602 -2.483105 -0.946123 -0.109224 -0.825417 ... 2.138478 7.399732 2.819477 0.325492 1.781882 0.678940 0.078380 2.349322 0.271215 0.103340
1 1.0 2.264423 -0.252348 3.635242 -1.695365 -0.567017 -2.818182 -1.183510 -1.495354 -0.571422 ... 0.961300 4.777845 2.006481 2.535169 1.597957 0.671070 0.847891 3.335346 4.214178 1.769766
2 1.0 2.407197 -0.285384 3.514516 -1.835526 -0.700168 -2.871521 -1.200218 -1.409066 -0.686975 ... 1.285177 5.270754 2.203032 2.586378 2.010548 0.840355 0.986583 3.446452 4.046164 1.691187
3 1.0 1.800267 -0.177009 3.836746 -0.226538 0.360890 -3.378267 -1.494108 1.350565 -0.318664 ... -0.081755 0.765305 0.338472 -0.305954 -1.219183 -0.539209 0.487405 5.047496 -4.562567 -2.017889
4 1.0 2.285621 -0.236315 3.690869 -1.931407 -0.768909 -2.861841 -1.110547 -1.684833 -0.540126 ... 1.485077 5.527380 2.144918 3.254098 2.200496 0.853909 1.295483 3.178208 4.821723 1.871085

5 rows × 37 columns

In [77]:
pc_X.columns
Out[77]:
Index(['Intercept', 'PC01', 'PC02', 'PC03', 'PC04', 'PC05', 'PC06', 'PC07',
       'PC08', 'PC01:PC02', 'PC01:PC03', 'PC01:PC04', 'PC01:PC05', 'PC01:PC06',
       'PC01:PC07', 'PC01:PC08', 'PC02:PC03', 'PC02:PC04', 'PC02:PC05',
       'PC02:PC06', 'PC02:PC07', 'PC02:PC08', 'PC03:PC04', 'PC03:PC05',
       'PC03:PC06', 'PC03:PC07', 'PC03:PC08', 'PC04:PC05', 'PC04:PC06',
       'PC04:PC07', 'PC04:PC08', 'PC05:PC06', 'PC05:PC07', 'PC05:PC08',
       'PC06:PC07', 'PC06:PC08', 'PC07:PC08'],
      dtype='object')
In [78]:
fig, ax = plt.subplots(figsize=(12, 8))

sns.heatmap(data = pc_X.drop(columns=['Intercept']).corr(), 
            vmin=-1, vmax=1, center = 0,
            cmap='coolwarm', 
            ax=ax)

plt.show()
In [79]:
features_df = pc_X.drop(columns=['Intercept']).copy()
In [80]:
feature_names = features_df.columns
In [81]:
%run CMPINF2120_EPM_PCA_INCL_Over_Lisa.ipynb
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2444 entries, 0 to 2443
Data columns (total 36 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   PC01    2444 non-null   float64
 1   PC02    2444 non-null   float64
 2   PC03    2444 non-null   float64
 3   PC04    2444 non-null   float64
 4   PC05    2444 non-null   float64
 5   PC06    2444 non-null   float64
 6   PC07    2444 non-null   float64
 7   PC08    2444 non-null   float64
 8   PC09    2444 non-null   float64
 9   PC10    2444 non-null   float64
 10  PC11    2444 non-null   float64
 11  PC12    2444 non-null   float64
 12  PC13    2444 non-null   float64
 13  PC14    2444 non-null   float64
 14  PC15    2444 non-null   float64
 15  PC16    2444 non-null   float64
 16  PC17    2444 non-null   float64
 17  PC18    2444 non-null   float64
 18  PC19    2444 non-null   float64
 19  PC20    2444 non-null   float64
 20  PC21    2444 non-null   float64
 21  PC22    2444 non-null   float64
 22  PC23    2444 non-null   float64
 23  PC24    2444 non-null   float64
 24  PC25    2444 non-null   float64
 25  PC26    2444 non-null   float64
 26  PC27    2444 non-null   float64
 27  PC28    2444 non-null   float64
 28  PC29    2444 non-null   float64
 29  PC30    2444 non-null   float64
 30  PC31    2444 non-null   float64
 31  PC32    2444 non-null   float64
 32  PC33    2444 non-null   float64
 33  PC34    2444 non-null   float64
 34  PC35    2444 non-null   float64
 35  PC36    2444 non-null   float64
dtypes: float64(36)
memory usage: 687.5 KB
In [82]:
pc_scores_df.shape
Out[82]:
(2444, 36)

Poisson Regression with PCA¶

In [83]:
pc_descr_formulas = ['final_events ~ sid'
                    ,'final_events ~ sid + actv_grp'
                    ,'final_events ~ sid + actv_grp + ' + pc_features_str
                    ,'final_events ~ sid * (' + pc_features_str + ')'
                    ,'final_events ~ sid * (actv_grp + ' + pc_features_str + ')'
                    ]
In [84]:
pc_pred_formulas = ['final_events ~ ' + pc_features_str
                    ,'final_events ~ (' + pc_features_str + ')**2'
                    ,'final_events ~ actv_grp + ' + pc_features_str
                    ,'final_events ~ actv_grp * (' + pc_features_str + ')'
                    ,'final_events ~ actv_grp + (' + pc_features_str + ')**2'
                    ,'final_events ~ actv_grp * (' + pc_features_str + ')**2'
                   ]
In [85]:
pc_formula_test_list = pc_descr_formulas + pc_pred_formulas
In [86]:
pc_formula_test_list
Out[86]:
['final_events ~ sid',
 'final_events ~ sid + actv_grp',
 'final_events ~ sid + actv_grp + PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08',
 'final_events ~ sid * (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)',
 'final_events ~ sid * (actv_grp + PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)',
 'final_events ~ PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08',
 'final_events ~ (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)**2',
 'final_events ~ actv_grp + PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08',
 'final_events ~ actv_grp * (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)',
 'final_events ~ actv_grp + (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)**2',
 'final_events ~ actv_grp * (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)**2']
In [87]:
##### Evaluate the number of features with dmatrices
In [88]:
pc_sk_list = make_dmat(pc_df_to_model, pc_formula_test_list)
In [89]:
pc_model_dim = make_dim_df(pc_df_to_model, pc_sk_list, pc_formula_test_list)
In [90]:
pc_model_dim
Out[90]:
model name dimensions number of obs dim < obs
0 0 62 2444 Yes
1 1 72 2444 Yes
2 2 80 2444 Yes
3 3 558 2444 Yes
4 4 1178 2444 Yes
5 5 9 2444 Yes
6 6 37 2444 Yes
7 7 19 2444 Yes
8 8 99 2444 Yes
9 9 47 2444 Yes
10 10 407 2444 Yes
Errors occur even thought the number of features is less than the number of observations¶
In [91]:
pc_adjust_desc_formulas = ['final_events ~ sid'
                    ,'final_events ~ sid + actv_grp'
                    ,'final_events ~ sid + actv_grp + ' + pc_features_str
                    #,'final_events ~ sid * (' + pc_features_str + ')'
                    #,'final_events ~ sid * (actv_grp + ' + pc_features_str + ')'
                    ]
In [92]:
pc_adjust_pred_formulas = ['final_events ~ ' + pc_features_str
                    ,'final_events ~ (' + pc_features_str + ')**2'
                    ,'final_events ~ actv_grp + ' + pc_features_str
                    ,'final_events ~ actv_grp * (' + pc_features_str + ')'
                    ,'final_events ~ actv_grp + (' + pc_features_str + ')**2'
                    #,'final_events ~ actv_grp * (' + pc_features_str + ')**2'
                   ]
In [93]:
pc_formula_list = pc_adjust_desc_formulas + pc_adjust_pred_formulas
In [94]:
pc_formula_list
Out[94]:
['final_events ~ sid',
 'final_events ~ sid + actv_grp',
 'final_events ~ sid + actv_grp + PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08',
 'final_events ~ PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08',
 'final_events ~ (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)**2',
 'final_events ~ actv_grp + PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08',
 'final_events ~ actv_grp * (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)',
 'final_events ~ actv_grp + (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)**2']
In [95]:
pc_model_list = []

for a_formula in pc_formula_list:
    #pc_model_list.append( smf.poisson( formula = a_formula, data = pc_df_to_model).fit(method='bfgs') )
    pc_model_list.append( smf.poisson( formula = a_formula, data = pc_df_to_model).fit(method='ncg') )
    #pc_model_list.append( smf.poisson( formula = a_formula, data = pc_df_to_model).fit(method='lbfgs') )
    #pc_model_list.append( smf.poisson( formula = a_formula, data = pc_df_to_model).fit(method='powell') )
    #pc_model_list.append( smf.poisson( formula = a_formula, data = pc_df_to_model).fit(method='newton') )
    #pc_model_list.append( smf.poisson( formula = a_formula, data = pc_df_to_model).fit(method='cg') )
    #pc_model_list.append( smf.poisson( formula = a_formula, data = pc_df_to_model).fit(method='basinhopping') )
Optimization terminated successfully.
         Current function value: 1.543124
         Iterations: 14
         Function evaluations: 15
         Gradient evaluations: 15
         Hessian evaluations: 14
Optimization terminated successfully.
         Current function value: 1.525616
         Iterations: 14
         Function evaluations: 15
         Gradient evaluations: 15
         Hessian evaluations: 14
Optimization terminated successfully.
         Current function value: 1.325239
         Iterations: 22
         Function evaluations: 24
         Gradient evaluations: 24
         Hessian evaluations: 22
Optimization terminated successfully.
         Current function value: 1.548077
         Iterations: 8
         Function evaluations: 9
         Gradient evaluations: 9
         Hessian evaluations: 8
Optimization terminated successfully.
         Current function value: 1.505923
         Iterations: 12
         Function evaluations: 13
         Gradient evaluations: 13
         Hessian evaluations: 12
Optimization terminated successfully.
         Current function value: 1.536368
         Iterations: 12
         Function evaluations: 13
         Gradient evaluations: 13
         Hessian evaluations: 12
Optimization terminated successfully.
         Current function value: 1.501270
         Iterations: 12
         Function evaluations: 13
         Gradient evaluations: 13
         Hessian evaluations: 12
Optimization terminated successfully.
         Current function value: 1.497400
         Iterations: 15
         Function evaluations: 16
         Gradient evaluations: 16
         Hessian evaluations: 15
In [96]:
pc_model_results = pd.DataFrame({'model_name': ['pc_mod00','pc_mod01','pc_mod02','pc_mod03','pc_mod04','pc_mod05','pc_mod06','pc_mod07'],
                              'AIC': [mod.aic for mod in pc_model_list],
                              'BIC': [mod.bic for mod in pc_model_list],
                              'Prsquared': [mod.prsquared for mod in pc_model_list]})
In [97]:
pc_model_results
Out[97]:
model_name AIC BIC Prsquared
0 pc_mod00 7666.789117 8026.475379 0.101106
1 pc_mod01 7601.210361 8018.910536 0.111305
2 pc_mod02 6637.769417 7101.880723 0.228027
3 pc_mod03 7584.999588 7637.212110 0.098221
4 pc_mod04 7434.950334 7649.601813 0.122776
5 pc_mod05 7547.767911 7657.994346 0.105041
6 pc_mod06 7536.208790 8110.546531 0.125487
7 pc_mod07 7413.290030 7685.955422 0.127741
In [98]:
sns.relplot(data = pc_model_results.melt(id_vars=['model_name']),
            x='model_name',
            y='value', 
            col='variable',
            col_wrap=2,
            facet_kws = {'sharey': False},
            height=5, aspect=2)

plt.show()
In [99]:
print(pc_model_list[2].summary())
                          Poisson Regression Results                          
==============================================================================
Dep. Variable:           final_events   No. Observations:                 2444
Model:                        Poisson   Df Residuals:                     2364
Method:                           MLE   Df Model:                           79
Date:                Thu, 27 Apr 2023   Pseudo R-squ.:                  0.2280
Time:                        07:52:57   Log-Likelihood:                -3238.9
converged:                       True   LL-Null:                       -4195.6
Covariance Type:            nonrobust   LLR p-value:                     0.000
===============================================================================================
                                  coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                       0.8569      0.125      6.841      0.000       0.611       1.102
sid[T.2]                       -0.6452      0.171     -3.766      0.000      -0.981      -0.309
sid[T.4]                       -1.4606      0.199     -7.326      0.000      -1.851      -1.070
sid[T.5]                       -0.1323      0.163     -0.813      0.416      -0.451       0.187
sid[T.7]                       -0.4821      0.157     -3.071      0.002      -0.790      -0.174
sid[T.8]                       -1.5239      0.236     -6.462      0.000      -1.986      -1.062
sid[T.9]                       -0.8776      0.179     -4.896      0.000      -1.229      -0.526
sid[T.11]                      -0.2181      0.161     -1.356      0.175      -0.533       0.097
sid[T.12]                      -0.8716      0.183     -4.765      0.000      -1.230      -0.513
sid[T.14]                      -0.0160      0.154     -0.104      0.917      -0.318       0.286
sid[T.19]                      -0.4075      0.182     -2.245      0.025      -0.763      -0.052
sid[T.20]                       0.3226      0.145      2.221      0.026       0.038       0.607
sid[T.22]                      -1.7517      0.278     -6.308      0.000      -2.296      -1.207
sid[T.24]                      -0.6711      0.170     -3.952      0.000      -1.004      -0.338
sid[T.25]                      -0.7196      0.187     -3.844      0.000      -1.086      -0.353
sid[T.30]                      -0.1450      0.165     -0.880      0.379      -0.468       0.178
sid[T.33]                     -12.0511    119.257     -0.101      0.920    -245.791     221.689
sid[T.34]                      -1.0135      0.178     -5.681      0.000      -1.363      -0.664
sid[T.37]                      -0.9214      0.251     -3.671      0.000      -1.413      -0.430
sid[T.38]                      -0.7803      0.171     -4.565      0.000      -1.115      -0.445
sid[T.39]                      -0.6693      0.170     -3.947      0.000      -1.002      -0.337
sid[T.42]                      -1.4754      0.192     -7.667      0.000      -1.853      -1.098
sid[T.44]                       1.2138      0.161      7.533      0.000       0.898       1.530
sid[T.45]                       0.2702      0.182      1.484      0.138      -0.087       0.627
sid[T.46]                      -1.0982      0.277     -3.962      0.000      -1.641      -0.555
sid[T.47]                      -1.1412      0.195     -5.853      0.000      -1.523      -0.759
sid[T.49]                      -1.1224      0.183     -6.138      0.000      -1.481      -0.764
sid[T.51]                      -1.4247      0.212     -6.708      0.000      -1.841      -1.008
sid[T.52]                      -1.4138      0.205     -6.898      0.000      -1.815      -1.012
sid[T.54]                      -0.9313      0.177     -5.261      0.000      -1.278      -0.584
sid[T.55]                       0.0668      0.170      0.394      0.694      -0.266       0.399
sid[T.56]                      -0.1048      0.154     -0.680      0.496      -0.407       0.197
sid[T.57]                     -12.7017    119.257     -0.107      0.915    -246.442     221.039
sid[T.58]                      -0.6118      0.427     -1.434      0.152      -1.448       0.224
sid[T.59]                      -1.3104      0.212     -6.174      0.000      -1.726      -0.894
sid[T.60]                     -13.4938    119.257     -0.113      0.910    -247.234     220.247
sid[T.61]                      -0.5787      0.178     -3.246      0.001      -0.928      -0.229
sid[T.62]                      -0.6456      0.276     -2.335      0.020      -1.187      -0.104
sid[T.64]                     -13.6254    119.257     -0.114      0.909    -247.366     220.115
sid[T.67]                      -0.1626      0.180     -0.903      0.366      -0.515       0.190
sid[T.68]                       0.0336      0.151      0.222      0.824      -0.263       0.330
sid[T.69]                      -0.7821      0.214     -3.649      0.000      -1.202      -0.362
sid[T.70]                      -0.7380      0.176     -4.188      0.000      -1.083      -0.393
sid[T.71]                      -0.4077      0.197     -2.072      0.038      -0.793      -0.022
sid[T.73]                      -0.8393      0.194     -4.335      0.000      -1.219      -0.460
sid[T.75]                       0.0326      0.161      0.203      0.839      -0.283       0.348
sid[T.77]                      -0.6609      0.426     -1.552      0.121      -1.495       0.174
sid[T.79]                      -0.8551      0.165     -5.192      0.000      -1.178      -0.532
sid[T.80]                      -0.9316      0.180     -5.178      0.000      -1.284      -0.579
sid[T.82]                      -1.6979      0.212     -7.995      0.000      -2.114      -1.282
sid[T.83]                      -1.5917      0.203     -7.853      0.000      -1.989      -1.194
sid[T.87]                      -0.5131      0.160     -3.210      0.001      -0.826      -0.200
sid[T.91]                      -1.1428      0.182     -6.264      0.000      -1.500      -0.785
sid[T.92]                      -0.5063      0.181     -2.804      0.005      -0.860      -0.152
sid[T.94]                      -0.4604      0.155     -2.962      0.003      -0.765      -0.156
sid[T.95]                      -1.0420      0.185     -5.623      0.000      -1.405      -0.679
sid[T.99]                      -1.2638      0.212     -5.957      0.000      -1.680      -0.848
sid[T.101]                     -0.9256      0.197     -4.697      0.000      -1.312      -0.539
sid[T.102]                     -1.2173      0.200     -6.097      0.000      -1.609      -0.826
sid[T.103]                    -13.7455    119.257     -0.115      0.908    -247.486     219.995
sid[T.104]                     -0.8747      0.277     -3.154      0.002      -1.418      -0.331
sid[T.106]                      0.7139      0.210      3.397      0.001       0.302       1.126
actv_grp[T.Blank]               0.0110      0.074      0.149      0.882      -0.134       0.156
actv_grp[T.Deeds]               0.0232      0.072      0.321      0.748      -0.119       0.165
actv_grp[T.Diagram]            -0.1319      0.073     -1.816      0.069      -0.274       0.010
actv_grp[T.FSM]                -0.8886      0.234     -3.805      0.000      -1.346      -0.431
actv_grp[T.FSM_Related]        -0.6528      0.182     -3.578      0.000      -1.011      -0.295
actv_grp[T.Other]               0.0238      0.076      0.313      0.755      -0.125       0.173
actv_grp[T.Properties]         -0.0863      0.073     -1.189      0.235      -0.229       0.056
actv_grp[T.Study]               0.0159      0.075      0.212      0.832      -0.131       0.163
actv_grp[T.Study_Materials]    -0.2142      0.172     -1.245      0.213      -0.551       0.123
actv_grp[T.TextEditor]         -0.0107      0.073     -0.147      0.883      -0.153       0.132
PC01                            0.0564      0.004     15.066      0.000       0.049       0.064
PC02                           -0.0306      0.008     -4.048      0.000      -0.045      -0.016
PC03                           -0.1173      0.008    -15.385      0.000      -0.132      -0.102
PC04                            0.0250      0.008      3.073      0.002       0.009       0.041
PC05                           -0.0447      0.009     -5.207      0.000      -0.062      -0.028
PC06                           -0.1237      0.010    -12.826      0.000      -0.143      -0.105
PC07                            0.0997      0.012      8.191      0.000       0.076       0.124
PC08                           -0.0274      0.015     -1.835      0.066      -0.057       0.002
===============================================================================================
In [100]:
my_coefplot(pc_model_list[2])

The linear additive model with sid and actv_grp variables and the PC features is the best model¶

Check the validity of the Poisson model¶

In [101]:
pc_df_to_model.final_events.mean()
Out[101]:
1.376022913256956
In [102]:
pc_df_to_model.final_events.var()
Out[102]:
2.3829041926797476

Compare the observed count to the fitted counts...undo the link, i.e., calculate the inverse link of the .fittedvalues attribute.¶

In [103]:
model_list[2].fittedvalues
Out[103]:
0       1.127279
1       0.879788
2       0.913107
3       0.828572
4       1.117086
          ...   
2439   -1.544863
2440   -1.581286
2441   -1.119468
2442   -1.119786
2443   -1.420777
Length: 2444, dtype: float64
In [104]:
np.exp(model_list[2].fittedvalues)
Out[104]:
0       3.087246
1       2.410388
2       2.492053
3       2.290045
4       3.055937
          ...   
2439    0.213341
2440    0.205710
2441    0.326453
2442    0.326350
2443    0.241526
Length: 2444, dtype: float64
In [105]:
df02 = pc_df_to_model.copy()
In [106]:
df02['avg_count'] = np.exp( pc_model_list[2].fittedvalues )
In [107]:
df02.head()
Out[107]:
sess sid actv_grp final_events final_trials PC01 PC02 PC03 PC04 PC05 ... PC69 PC70 PC71 PC72 PC73 PC74 PC75 PC76 PC77 avg_count
0 1 1 Aulaweb 2.0 2.0 -1.343494 0.614381 3.048655 -2.980032 -0.717602 ... 0.007764 -0.042438 0.007726 0.016882 -0.033860 -0.020212 -0.020811 0.002340 0.013947 1.782928
1 1 1 Blank 2.0 2.0 2.264423 -0.252348 3.635242 -1.695365 -0.567017 ... 0.016730 0.010136 -0.022828 -0.049764 -0.025509 -0.011965 -0.003058 0.010145 -0.003176 2.297561
2 1 1 Deeds 2.0 2.0 2.407197 -0.285384 3.514516 -1.835526 -0.700168 ... -0.002833 -0.015106 0.000015 -0.020132 -0.009533 -0.014708 0.001780 0.028581 -0.000724 2.392456
3 1 1 Diagram 2.0 2.0 1.800267 -0.177009 3.836746 -0.226538 0.360890 ... 0.019233 -0.019455 -0.021534 -0.068026 -0.055084 0.009613 0.001835 -0.002256 -0.030537 1.808362
4 1 1 Other 2.0 2.0 2.285621 -0.236315 3.690869 -1.931407 -0.768909 ... -0.003566 0.041341 -0.008955 -0.027152 -0.000594 -0.012622 0.027929 0.006442 -0.031819 2.362641

5 rows × 83 columns

Calculate the new auxillary statistic¶
In [108]:
df02['t'] = ( (df02.final_events - df02.avg_count)**2 - df02.avg_count ) / df02.avg_count
In [109]:
aux_mod = smf.ols( 't ~ avg_count - 1', data = df02).fit()
In [110]:
print( aux_mod.summary() )
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                      t   R-squared (uncentered):                   0.002
Model:                            OLS   Adj. R-squared (uncentered):              0.002
Method:                 Least Squares   F-statistic:                              4.881
Date:                Thu, 27 Apr 2023   Prob (F-statistic):                      0.0272
Time:                        07:52:57   Log-Likelihood:                         -4813.7
No. Observations:                2444   AIC:                                      9629.
Df Residuals:                    2443   BIC:                                      9635.
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
avg_count     -0.0439      0.020     -2.209      0.027      -0.083      -0.005
==============================================================================
Omnibus:                     4162.344   Durbin-Watson:                   0.966
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          6897299.670
Skew:                          11.151   Prob(JB):                         0.00
Kurtosis:                     262.295   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [111]:
my_coefplot( aux_mod )

Model 02 is slightly underdispersed. The auxillary model's slope is not statistically significantly positive and the scale of the auxillary statistics and error bar is low in magnitude, so the Poisson regression assumption of Variance = mean is valid in this case.

Negative Binomial Regression¶

Negative Binomial without PCA¶

In [112]:
dm00_y, dm00_X = dmatrices('final_events ~ sid', data=final_sqrt_df, return_type='dataframe')
dm01_y, dm01_X = dmatrices('final_events ~ sid + actv_grp', data=final_sqrt_df, return_type='dataframe')
dm02_y, dm02_X = dmatrices('final_events ~ sid + actv_grp + ' + num_features_str, data=final_sqrt_df, return_type='dataframe')
dm03_y, dm03_X = dmatrices('final_events ~ ' + num_features_str, data=final_sqrt_df, return_type='dataframe')
dm04_y, dm04_X = dmatrices('final_events ~ (' + num_features_str + ')**2', data=final_sqrt_df, return_type='dataframe')
In [113]:
dm00_X.head()
Out[113]:
Intercept sid[T.2] sid[T.4] sid[T.5] sid[T.7] sid[T.8] sid[T.9] sid[T.11] sid[T.12] sid[T.14] ... sid[T.91] sid[T.92] sid[T.94] sid[T.95] sid[T.99] sid[T.101] sid[T.102] sid[T.103] sid[T.104] sid[T.106]
0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 62 columns

In [114]:
dm00_y.head()
Out[114]:
final_events
0 2.0
1 2.0
2 2.0
3 2.0
4 2.0
In [115]:
modNB00 = sm.GLM(dm00_y, dm00_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
modNB01 = sm.GLM(dm01_y, dm01_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
modNB02 = sm.GLM(dm02_y, dm02_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
modNB03 = sm.GLM(dm03_y, dm03_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
In [116]:
modNB_list = [modNB00,modNB01,modNB02,modNB03]
In [117]:
modNB_results = pd.DataFrame({'model_name': ['modNB00','modNB01','modNB02','modNB03'],
                              'AIC': [mod.aic for mod in modNB_list],
                              'BIC': [mod.bic for mod in modNB_list]})
/Users/lisaover/opt/anaconda3/envs/cmpinf2120/lib/python3.8/site-packages/statsmodels/genmod/generalized_linear_model.py:1799: FutureWarning: The bic value is computed using the deviance formula. After 0.13 this will change to the log-likelihood based formula. This change has no impact on the relative rank of models compared using BIC. You can directly access the log-likelihood version using the `bic_llf` attribute. You can suppress this message by calling statsmodels.genmod.generalized_linear_model.SET_USE_BIC_LLF with True to get the LLF-based version now or False to retainthe deviance version.
  warnings.warn(
In [118]:
modNB_results
Out[118]:
model_name AIC BIC
0 modNB00 7622.925542 -16595.505299
1 modNB01 7594.774720 -16565.642208
2 modNB02 7162.815262 -16550.894535
3 modNB03 7480.925401 -16644.683179
In [119]:
sns.relplot(data = modNB_results.melt(id_vars=['model_name']),
            x='model_name',
            y='value', 
            col='variable',
            col_wrap=2,
            facet_kws = {'sharey': False})

plt.show()
In [120]:
print(modNB02.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:           final_events   No. Observations:                 2444
Model:                            GLM   Df Residuals:                     2295
Model Family:        NegativeBinomial   Df Model:                          148
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -3432.4
Date:                Thu, 27 Apr 2023   Deviance:                       1353.3
Time:                        07:52:58   Pearson chi2:                 1.08e+03
No. Iterations:                    24   Pseudo R-squ. (CS):             0.3464
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                       0.1358      0.288      0.472      0.637      -0.428       0.699
sid[T.2]                       -1.0080      0.294     -3.430      0.001      -1.584      -0.432
sid[T.4]                       -1.6195      0.336     -4.814      0.000      -2.279      -0.960
sid[T.5]                       -0.6385      0.286     -2.233      0.026      -1.199      -0.078
sid[T.7]                       -0.5035      0.280     -1.798      0.072      -1.052       0.045
sid[T.8]                       -1.8584      0.387     -4.807      0.000      -2.616      -1.101
sid[T.9]                       -1.3715      0.329     -4.167      0.000      -2.017      -0.726
sid[T.11]                      -0.2581      0.277     -0.931      0.352      -0.801       0.285
sid[T.12]                      -1.1180      0.319     -3.505      0.000      -1.743      -0.493
sid[T.14]                      -0.2066      0.276     -0.748      0.454      -0.748       0.335
sid[T.19]                      -0.7361      0.308     -2.389      0.017      -1.340      -0.132
sid[T.20]                       0.1670      0.270      0.618      0.537      -0.363       0.697
sid[T.22]                      -1.7593      0.403     -4.363      0.000      -2.550      -0.969
sid[T.24]                      -1.2135      0.304     -3.993      0.000      -1.809      -0.618
sid[T.25]                      -0.7285      0.351     -2.077      0.038      -1.416      -0.041
sid[T.30]                      -0.5420      0.291     -1.860      0.063      -1.113       0.029
sid[T.33]                     -25.1326   4.14e+04     -0.001      1.000   -8.12e+04    8.11e+04
sid[T.34]                      -1.1691      0.308     -3.798      0.000      -1.772      -0.566
sid[T.37]                      -1.3000      0.361     -3.604      0.000      -2.007      -0.593
sid[T.38]                      -0.9761      0.296     -3.298      0.001      -1.556      -0.396
sid[T.39]                      -0.5863      0.323     -1.815      0.070      -1.219       0.047
sid[T.42]                      -1.8335      0.342     -5.368      0.000      -2.503      -1.164
sid[T.44]                       0.8420      0.319      2.638      0.008       0.216       1.468
sid[T.45]                       0.0189      0.337      0.056      0.955      -0.642       0.680
sid[T.46]                      -1.6314      0.398     -4.096      0.000      -2.412      -0.851
sid[T.47]                      -1.5105      0.333     -4.531      0.000      -2.164      -0.857
sid[T.49]                      -1.4406      0.303     -4.756      0.000      -2.034      -0.847
sid[T.51]                      -1.5684      0.323     -4.853      0.000      -2.202      -0.935
sid[T.52]                      -1.6853      0.323     -5.220      0.000      -2.318      -1.053
sid[T.54]                      -0.9987      0.299     -3.340      0.001      -1.585      -0.413
sid[T.55]                       0.0110      0.303      0.036      0.971      -0.583       0.605
sid[T.56]                      -0.3047      0.285     -1.068      0.286      -0.864       0.255
sid[T.57]                     -24.5318   2.91e+04     -0.001      0.999    -5.7e+04    5.69e+04
sid[T.58]                      -1.0060      0.635     -1.583      0.113      -2.251       0.239
sid[T.59]                      -1.5392      0.327     -4.705      0.000      -2.180      -0.898
sid[T.60]                     -25.1975   2.23e+04     -0.001      0.999   -4.38e+04    4.37e+04
sid[T.61]                      -0.6115      0.303     -2.015      0.044      -1.206      -0.017
sid[T.62]                      -1.0567      0.424     -2.491      0.013      -1.888      -0.225
sid[T.64]                     -25.5006   2.36e+04     -0.001      0.999   -4.64e+04    4.63e+04
sid[T.67]                      -0.3932      0.325     -1.211      0.226      -1.029       0.243
sid[T.68]                      -0.0400      0.279     -0.143      0.886      -0.587       0.507
sid[T.69]                      -0.8165      0.337     -2.425      0.015      -1.476      -0.157
sid[T.70]                      -0.5305      0.303     -1.750      0.080      -1.125       0.064
sid[T.71]                      -0.5808      0.319     -1.818      0.069      -1.207       0.045
sid[T.73]                      -1.2784      0.351     -3.641      0.000      -1.967      -0.590
sid[T.75]                      -0.0387      0.300     -0.129      0.897      -0.627       0.549
sid[T.77]                      -0.6072      0.633     -0.959      0.338      -1.848       0.634
sid[T.79]                      -0.7467      0.294     -2.536      0.011      -1.324      -0.170
sid[T.80]                      -1.2728      0.319     -3.991      0.000      -1.898      -0.648
sid[T.82]                      -2.1508      0.330     -6.511      0.000      -2.798      -1.503
sid[T.83]                      -1.9852      0.346     -5.732      0.000      -2.664      -1.306
sid[T.87]                      -0.5944      0.283     -2.102      0.036      -1.149      -0.040
sid[T.91]                      -1.5260      0.313     -4.871      0.000      -2.140      -0.912
sid[T.92]                      -0.8436      0.306     -2.756      0.006      -1.444      -0.244
sid[T.94]                      -0.4750      0.278     -1.706      0.088      -1.021       0.071
sid[T.95]                      -1.2901      0.318     -4.060      0.000      -1.913      -0.667
sid[T.99]                      -1.6616      0.373     -4.457      0.000      -2.392      -0.931
sid[T.101]                     -1.5112      0.343     -4.403      0.000      -2.184      -0.839
sid[T.102]                     -1.3697      0.337     -4.066      0.000      -2.030      -0.709
sid[T.103]                    -26.0552   2.57e+04     -0.001      0.999   -5.05e+04    5.04e+04
sid[T.104]                     -1.2339      0.445     -2.776      0.006      -2.105      -0.363
sid[T.106]                      0.4919      0.529      0.930      0.352      -0.544       1.528
actv_grp[T.Blank]               0.0502      0.128      0.393      0.694      -0.200       0.301
actv_grp[T.Deeds]               0.0282      0.128      0.221      0.825      -0.222       0.279
actv_grp[T.Diagram]            -0.0721      0.128     -0.565      0.572      -0.322       0.178
actv_grp[T.FSM]                -0.5637      0.319     -1.766      0.077      -1.189       0.062
actv_grp[T.FSM_Related]        -0.5141      0.263     -1.958      0.050      -1.029       0.001
actv_grp[T.Other]               0.0708      0.131      0.539      0.590      -0.187       0.328
actv_grp[T.Properties]         -0.0540      0.125     -0.431      0.666      -0.300       0.192
actv_grp[T.Study]               0.0734      0.130      0.566      0.571      -0.181       0.327
actv_grp[T.Study_Materials]     0.0167      0.296      0.057      0.955      -0.563       0.597
actv_grp[T.TextEditor]         -0.0060      0.128     -0.047      0.963      -0.257       0.245
total_ms_tp000_sqrt             0.0002      0.000      0.838      0.402      -0.000       0.001
mw_tp000_sqrt                   0.0049      0.008      0.624      0.533      -0.011       0.020
mwc_tp000_sqrt                 -0.0453      0.081     -0.557      0.578      -0.205       0.114
mcl_tp000_sqrt                 -0.0160      0.018     -0.894      0.371      -0.051       0.019
mcr_tp000_sqrt                 -0.0188      0.021     -0.896      0.370      -0.060       0.022
mm_tp000_sqrt                   0.0025      0.002      1.284      0.199      -0.001       0.006
ks_tp000_sqrt                  -0.0067      0.006     -1.134      0.257      -0.018       0.005
total_ms_tp010_sqrt             0.0005      0.000      1.406      0.160      -0.000       0.001
mw_tp010_sqrt                   0.0010      0.010      0.094      0.925      -0.019       0.021
mwc_tp010_sqrt                  0.0248      0.102      0.244      0.807      -0.175       0.224
mcl_tp010_sqrt                  0.0293      0.024      1.240      0.215      -0.017       0.076
mcr_tp010_sqrt                 -0.0091      0.025     -0.367      0.713      -0.058       0.039
mm_tp010_sqrt                  -0.0068      0.003     -2.313      0.021      -0.012      -0.001
ks_tp010_sqrt                   0.0030      0.007      0.409      0.683      -0.011       0.017
total_ms_tp020_sqrt             0.0001      0.000      0.302      0.762      -0.001       0.001
mw_tp020_sqrt                  -0.0100      0.014     -0.716      0.474      -0.037       0.017
mwc_tp020_sqrt                  0.0252      0.134      0.188      0.851      -0.238       0.288
mcl_tp020_sqrt                 -0.0418      0.031     -1.366      0.172      -0.102       0.018
mcr_tp020_sqrt                  0.0013      0.032      0.042      0.966      -0.061       0.063
mm_tp020_sqrt                   0.0065      0.004      1.607      0.108      -0.001       0.014
ks_tp020_sqrt                  -0.0090      0.009     -0.953      0.341      -0.028       0.010
total_ms_tp030_sqrt             0.0003      0.001      0.629      0.530      -0.001       0.001
mw_tp030_sqrt                   0.0140      0.016      0.900      0.368      -0.017       0.045
mwc_tp030_sqrt                  0.0117      0.140      0.084      0.933      -0.263       0.287
mcl_tp030_sqrt                  0.0292      0.037      0.786      0.432      -0.044       0.102
mcr_tp030_sqrt                  0.0975      0.040      2.436      0.015       0.019       0.176
mm_tp030_sqrt                  -0.0069      0.005     -1.433      0.152      -0.016       0.003
ks_tp030_sqrt                  -0.0087      0.012     -0.716      0.474      -0.033       0.015
total_ms_tp040_sqrt             0.0007      0.001      1.172      0.241      -0.000       0.002
mw_tp040_sqrt                  -0.0135      0.016     -0.838      0.402      -0.045       0.018
mwc_tp040_sqrt                  0.0211      0.122      0.172      0.863      -0.219       0.261
mcl_tp040_sqrt                 -0.0314      0.041     -0.760      0.447      -0.112       0.050
mcr_tp040_sqrt                  0.0365      0.046      0.797      0.426      -0.053       0.126
mm_tp040_sqrt                   0.0006      0.005      0.104      0.918      -0.010       0.011
ks_tp040_sqrt                  -0.0177      0.015     -1.205      0.228      -0.046       0.011
total_ms_tp050_sqrt             0.0009      0.001      1.408      0.159      -0.000       0.002
mw_tp050_sqrt                   0.0400      0.021      1.929      0.054      -0.001       0.081
mwc_tp050_sqrt                 -0.1063      0.150     -0.708      0.479      -0.401       0.188
mcl_tp050_sqrt                 -0.0281      0.045     -0.631      0.528      -0.115       0.059
mcr_tp050_sqrt                  0.0263      0.047      0.555      0.579      -0.066       0.119
mm_tp050_sqrt                  -0.0034      0.006     -0.577      0.564      -0.015       0.008
ks_tp050_sqrt                   0.0049      0.015      0.326      0.745      -0.024       0.034
total_ms_tp060_sqrt            -0.0008      0.001     -1.059      0.290      -0.002       0.001
mw_tp060_sqrt                  -0.0266      0.024     -1.133      0.257      -0.073       0.019
mwc_tp060_sqrt                  0.2214      0.232      0.955      0.340      -0.233       0.676
mcl_tp060_sqrt                 -0.0343      0.047     -0.727      0.467      -0.127       0.058
mcr_tp060_sqrt                 -0.0076      0.056     -0.134      0.893      -0.118       0.103
mm_tp060_sqrt                   0.0073      0.006      1.153      0.249      -0.005       0.020
ks_tp060_sqrt                   0.0150      0.017      0.876      0.381      -0.019       0.049
total_ms_tp070_sqrt             0.0007      0.001      0.831      0.406      -0.001       0.002
mw_tp070_sqrt                   0.0168      0.023      0.722      0.470      -0.029       0.062
mwc_tp070_sqrt                 -0.0605      0.265     -0.228      0.820      -0.581       0.460
mcl_tp070_sqrt                  0.0406      0.048      0.841      0.400      -0.054       0.135
mcr_tp070_sqrt                 -0.1275      0.062     -2.062      0.039      -0.249      -0.006
mm_tp070_sqrt                  -0.0062      0.006     -0.956      0.339      -0.019       0.007
ks_tp070_sqrt                  -0.0105      0.019     -0.550      0.582      -0.048       0.027
total_ms_tp080_sqrt             0.0004      0.001      0.515      0.607      -0.001       0.002
mw_tp080_sqrt                   0.0076      0.026      0.290      0.772      -0.044       0.059
mwc_tp080_sqrt                 -0.1163      0.310     -0.375      0.708      -0.724       0.491
mcl_tp080_sqrt                 -0.0105      0.052     -0.203      0.839      -0.112       0.091
mcr_tp080_sqrt                 -0.0517      0.071     -0.728      0.466      -0.191       0.087
mm_tp080_sqrt                   0.0014      0.007      0.200      0.842      -0.013       0.015
ks_tp080_sqrt                   0.0071      0.020      0.347      0.728      -0.033       0.047
total_ms_tp090_sqrt            -0.0004      0.001     -0.476      0.634      -0.002       0.001
mw_tp090_sqrt                  -0.0260      0.026     -0.994      0.320      -0.077       0.025
mwc_tp090_sqrt                 -0.1285      0.283     -0.454      0.650      -0.683       0.426
mcl_tp090_sqrt                  0.0290      0.050      0.576      0.565      -0.070       0.128
mcr_tp090_sqrt                 -0.0127      0.079     -0.161      0.872      -0.168       0.142
mm_tp090_sqrt                  -0.0016      0.007     -0.223      0.824      -0.015       0.012
ks_tp090_sqrt                   0.0280      0.020      1.386      0.166      -0.012       0.068
total_ms_tp100_sqrt            -0.0016      0.001     -2.897      0.004      -0.003      -0.001
mw_tp100_sqrt                   0.0060      0.016      0.377      0.706      -0.025       0.037
mwc_tp100_sqrt                  0.1975      0.189      1.045      0.296      -0.173       0.568
mcl_tp100_sqrt                  0.0082      0.033      0.245      0.807      -0.057       0.074
mcr_tp100_sqrt                  0.1447      0.055      2.612      0.009       0.036       0.253
mm_tp100_sqrt                   0.0029      0.005      0.627      0.531      -0.006       0.012
ks_tp100_sqrt                  -0.0007      0.012     -0.057      0.955      -0.025       0.024
===============================================================================================
In [121]:
my_coefplot(modNB02)

Negative Binomial with PCA¶

In [122]:
pc_dm00_y, pc_dm00_X = dmatrices('final_events ~ sid', data=pc_df_to_model, return_type='dataframe')
pc_dm01_y, pc_dm01_X = dmatrices('final_events ~ sid + actv_grp', data=pc_df_to_model, return_type='dataframe')
pc_dm02_y, pc_dm02_X = dmatrices('final_events ~ sid + actv_grp + ' + pc_features_str, data=pc_df_to_model, return_type='dataframe')
pc_dm03_y, pc_dm03_X = dmatrices('final_events ~ ' + pc_features_str, data=pc_df_to_model, return_type='dataframe')
pc_dm04_y, pc_dm04_X = dmatrices('final_events ~ (' + pc_features_str + ')**2', data=pc_df_to_model, return_type='dataframe')
pc_dm05_y, pc_dm05_X = dmatrices('final_events ~ actv_grp +' + pc_features_str, data=pc_df_to_model, return_type='dataframe')
pc_dm06_y, pc_dm06_X = dmatrices('final_events ~ actv_grp * (' + pc_features_str + ')', data=pc_df_to_model, return_type='dataframe')
pc_dm07_y, pc_dm07_X = dmatrices('final_events ~ actv_grp + (' + pc_features_str + ')**2', data=pc_df_to_model, return_type='dataframe')
In [123]:
pc_modNB00 = sm.GLM(pc_dm00_y, pc_dm00_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB01 = sm.GLM(pc_dm01_y, pc_dm01_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB02 = sm.GLM(pc_dm02_y, pc_dm02_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB03 = sm.GLM(pc_dm03_y, pc_dm03_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB04 = sm.GLM(pc_dm04_y, pc_dm04_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB05 = sm.GLM(pc_dm05_y, pc_dm05_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB06 = sm.GLM(pc_dm06_y, pc_dm06_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB07 = sm.GLM(pc_dm07_y, pc_dm07_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
In [124]:
pc_modNB_list = [pc_modNB00,pc_modNB01,pc_modNB02,pc_modNB03,pc_modNB04,pc_modNB05,pc_modNB06,pc_modNB07]
In [125]:
pc_modNB_results = pd.DataFrame({'model_name': ['pc_modNB00','pc_modNB01','pc_modNB02','pc_modNB03','pc_modNB04','pc_modNB05','pc_modNB06','pc_modNB07'],
                              'AIC': [mod.aic for mod in pc_modNB_list],
                              'BIC': [mod.bic for mod in pc_modNB_list]})
/Users/lisaover/opt/anaconda3/envs/cmpinf2120/lib/python3.8/site-packages/statsmodels/genmod/generalized_linear_model.py:1799: FutureWarning: The bic value is computed using the deviance formula. After 0.13 this will change to the log-likelihood based formula. This change has no impact on the relative rank of models compared using BIC. You can directly access the log-likelihood version using the `bic_llf` attribute. You can suppress this message by calling statsmodels.genmod.generalized_linear_model.SET_USE_BIC_LLF with True to get the LLF-based version now or False to retainthe deviance version.
  warnings.warn(
In [126]:
pc_modNB_results
Out[126]:
model_name AIC BIC
0 pc_modNB00 7622.925542 -16595.505299
1 pc_modNB01 7594.774720 -16565.642208
2 pc_modNB02 7203.795463 -16910.210334
3 pc_modNB03 7614.470712 -16911.433869
4 pc_modNB04 7561.898320 -16801.567304
5 pc_modNB05 7604.852347 -16863.038321
6 pc_modNB06 7690.765764 -16313.013598
7 pc_modNB07 7563.870430 -16741.581281
In [127]:
sns.relplot(data = pc_modNB_results.melt(id_vars=['model_name']),
            x='model_name',
            y='value', 
            col='variable',
            col_wrap=2,
            facet_kws = {'sharey': False},
            height=5, aspect=2)

plt.show()
In [128]:
print(pc_modNB02.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:           final_events   No. Observations:                 2444
Model:                            GLM   Df Residuals:                     2364
Model Family:        NegativeBinomial   Df Model:                           79
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -3521.9
Date:                Thu, 27 Apr 2023   Deviance:                       1532.3
Time:                        07:53:00   Pearson chi2:                 1.20e+03
No. Iterations:                    24   Pseudo R-squ. (CS):             0.2967
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                       1.0430      0.208      5.004      0.000       0.635       1.452
sid[T.2]                       -0.8497      0.275     -3.090      0.002      -1.389      -0.311
sid[T.4]                       -1.7163      0.308     -5.572      0.000      -2.320      -1.113
sid[T.5]                       -0.2894      0.272     -1.066      0.287      -0.822       0.243
sid[T.7]                       -0.5547      0.265     -2.094      0.036      -1.074      -0.035
sid[T.8]                       -1.9066      0.362     -5.263      0.000      -2.617      -1.197
sid[T.9]                       -1.2008      0.304     -3.952      0.000      -1.796      -0.605
sid[T.11]                      -0.4263      0.267     -1.598      0.110      -0.949       0.097
sid[T.12]                      -1.1423      0.292     -3.911      0.000      -1.715      -0.570
sid[T.14]                      -0.1230      0.265     -0.465      0.642      -0.642       0.396
sid[T.19]                      -0.5337      0.296     -1.802      0.072      -1.114       0.047
sid[T.20]                       0.3042      0.255      1.195      0.232      -0.195       0.803
sid[T.22]                      -1.9202      0.371     -5.170      0.000      -2.648      -1.192
sid[T.24]                      -0.8558      0.285     -3.001      0.003      -1.415      -0.297
sid[T.25]                      -0.9986      0.330     -3.024      0.002      -1.646      -0.351
sid[T.30]                      -0.2159      0.271     -0.796      0.426      -0.747       0.315
sid[T.33]                     -24.9732    4.2e+04     -0.001      1.000   -8.23e+04    8.23e+04
sid[T.34]                      -1.3095      0.289     -4.526      0.000      -1.877      -0.742
sid[T.37]                      -1.0763      0.350     -3.074      0.002      -1.762      -0.390
sid[T.38]                      -0.9661      0.281     -3.442      0.001      -1.516      -0.416
sid[T.39]                      -0.5894      0.286     -2.064      0.039      -1.149      -0.030
sid[T.42]                      -1.8109      0.303     -5.980      0.000      -2.404      -1.217
sid[T.44]                       1.2965      0.279      4.651      0.000       0.750       1.843
sid[T.45]                       0.1783      0.320      0.557      0.577      -0.449       0.806
sid[T.46]                      -1.1605      0.369     -3.145      0.002      -1.884      -0.437
sid[T.47]                      -1.4276      0.307     -4.654      0.000      -2.029      -0.826
sid[T.49]                      -1.2740      0.287     -4.432      0.000      -1.837      -0.711
sid[T.51]                      -1.5121      0.302     -5.008      0.000      -2.104      -0.920
sid[T.52]                      -1.5999      0.306     -5.225      0.000      -2.200      -1.000
sid[T.54]                      -1.1983      0.284     -4.216      0.000      -1.755      -0.641
sid[T.55]                      -0.1298      0.295     -0.440      0.660      -0.709       0.449
sid[T.56]                      -0.2980      0.267     -1.116      0.264      -0.821       0.225
sid[T.57]                     -24.8433   2.93e+04     -0.001      0.999   -5.74e+04    5.73e+04
sid[T.58]                      -0.7900      0.613     -1.289      0.197      -1.991       0.411
sid[T.59]                      -1.5478      0.309     -5.014      0.000      -2.153      -0.943
sid[T.60]                     -25.0779   2.23e+04     -0.001      0.999   -4.38e+04    4.37e+04
sid[T.61]                      -0.6813      0.288     -2.363      0.018      -1.246      -0.116
sid[T.62]                      -0.8847      0.416     -2.128      0.033      -1.700      -0.070
sid[T.64]                     -25.4076   2.39e+04     -0.001      0.999   -4.68e+04    4.67e+04
sid[T.67]                      -0.1744      0.308     -0.566      0.572      -0.779       0.430
sid[T.68]                      -0.1111      0.268     -0.414      0.679      -0.637       0.415
sid[T.69]                      -0.8471      0.317     -2.673      0.008      -1.468      -0.226
sid[T.70]                      -0.6750      0.287     -2.352      0.019      -1.237      -0.113
sid[T.71]                      -0.5233      0.307     -1.705      0.088      -1.125       0.078
sid[T.73]                      -1.0350      0.319     -3.240      0.001      -1.661      -0.409
sid[T.75]                      -0.0514      0.285     -0.180      0.857      -0.610       0.507
sid[T.77]                      -0.8524      0.611     -1.395      0.163      -2.050       0.346
sid[T.79]                      -0.8940      0.272     -3.289      0.001      -1.427      -0.361
sid[T.80]                      -1.2820      0.293     -4.376      0.000      -1.856      -0.708
sid[T.82]                      -1.9009      0.309     -6.150      0.000      -2.507      -1.295
sid[T.83]                      -1.9442      0.320     -6.084      0.000      -2.571      -1.318
sid[T.87]                      -0.5847      0.267     -2.188      0.029      -1.109      -0.061
sid[T.91]                      -1.4830      0.299     -4.967      0.000      -2.068      -0.898
sid[T.92]                      -0.6252      0.282     -2.218      0.027      -1.178      -0.073
sid[T.94]                      -0.4256      0.263     -1.617      0.106      -0.942       0.090
sid[T.95]                      -1.1891      0.296     -4.018      0.000      -1.769      -0.609
sid[T.99]                      -1.6806      0.342     -4.918      0.000      -2.350      -1.011
sid[T.101]                     -1.2669      0.323     -3.928      0.000      -1.899      -0.635
sid[T.102]                     -1.3029      0.311     -4.192      0.000      -1.912      -0.694
sid[T.103]                    -25.7103   2.66e+04     -0.001      0.999   -5.22e+04    5.21e+04
sid[T.104]                     -1.1449      0.419     -2.732      0.006      -1.966      -0.323
sid[T.106]                      0.6389      0.442      1.446      0.148      -0.227       1.505
actv_grp[T.Blank]              -0.0227      0.122     -0.186      0.852      -0.261       0.216
actv_grp[T.Deeds]              -0.0472      0.120     -0.393      0.694      -0.282       0.188
actv_grp[T.Diagram]            -0.2020      0.121     -1.666      0.096      -0.440       0.036
actv_grp[T.FSM]                -0.9596      0.305     -3.144      0.002      -1.558      -0.361
actv_grp[T.FSM_Related]        -0.7790      0.257     -3.033      0.002      -1.282      -0.276
actv_grp[T.Other]               0.0190      0.125      0.152      0.879      -0.226       0.264
actv_grp[T.Properties]         -0.1445      0.121     -1.197      0.231      -0.381       0.092
actv_grp[T.Study]              -0.0045      0.123     -0.036      0.971      -0.246       0.237
actv_grp[T.Study_Materials]    -0.3177      0.288     -1.101      0.271      -0.883       0.248
actv_grp[T.TextEditor]         -0.0649      0.120     -0.540      0.589      -0.300       0.171
PC01                            0.0527      0.006      8.363      0.000       0.040       0.065
PC02                           -0.0298      0.013     -2.283      0.022      -0.055      -0.004
PC03                           -0.1282      0.012    -10.563      0.000      -0.152      -0.104
PC04                            0.0394      0.014      2.824      0.005       0.012       0.067
PC05                           -0.0303      0.015     -2.074      0.038      -0.059      -0.002
PC06                           -0.1435      0.016     -8.893      0.000      -0.175      -0.112
PC07                            0.1002      0.020      5.048      0.000       0.061       0.139
PC08                           -0.0413      0.025     -1.641      0.101      -0.091       0.008
===============================================================================================
In [129]:
my_coefplot(pc_modNB02)

Visualization Grid¶

In [130]:
best_model = smf.poisson( formula = 'final_events ~ sid + actv_grp + ' + pc_features_str, data = pc_df_to_model).fit(method='ncg')
Optimization terminated successfully.
         Current function value: 1.325239
         Iterations: 22
         Function evaluations: 24
         Gradient evaluations: 24
         Hessian evaluations: 22
In [131]:
print(best_model.summary())
                          Poisson Regression Results                          
==============================================================================
Dep. Variable:           final_events   No. Observations:                 2444
Model:                        Poisson   Df Residuals:                     2364
Method:                           MLE   Df Model:                           79
Date:                Thu, 27 Apr 2023   Pseudo R-squ.:                  0.2280
Time:                        07:53:01   Log-Likelihood:                -3238.9
converged:                       True   LL-Null:                       -4195.6
Covariance Type:            nonrobust   LLR p-value:                     0.000
===============================================================================================
                                  coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                       0.8569      0.125      6.841      0.000       0.611       1.102
sid[T.2]                       -0.6452      0.171     -3.766      0.000      -0.981      -0.309
sid[T.4]                       -1.4606      0.199     -7.326      0.000      -1.851      -1.070
sid[T.5]                       -0.1323      0.163     -0.813      0.416      -0.451       0.187
sid[T.7]                       -0.4821      0.157     -3.071      0.002      -0.790      -0.174
sid[T.8]                       -1.5239      0.236     -6.462      0.000      -1.986      -1.062
sid[T.9]                       -0.8776      0.179     -4.896      0.000      -1.229      -0.526
sid[T.11]                      -0.2181      0.161     -1.356      0.175      -0.533       0.097
sid[T.12]                      -0.8716      0.183     -4.765      0.000      -1.230      -0.513
sid[T.14]                      -0.0160      0.154     -0.104      0.917      -0.318       0.286
sid[T.19]                      -0.4075      0.182     -2.245      0.025      -0.763      -0.052
sid[T.20]                       0.3226      0.145      2.221      0.026       0.038       0.607
sid[T.22]                      -1.7517      0.278     -6.308      0.000      -2.296      -1.207
sid[T.24]                      -0.6711      0.170     -3.952      0.000      -1.004      -0.338
sid[T.25]                      -0.7196      0.187     -3.844      0.000      -1.086      -0.353
sid[T.30]                      -0.1450      0.165     -0.880      0.379      -0.468       0.178
sid[T.33]                     -12.0511    119.257     -0.101      0.920    -245.791     221.689
sid[T.34]                      -1.0135      0.178     -5.681      0.000      -1.363      -0.664
sid[T.37]                      -0.9214      0.251     -3.671      0.000      -1.413      -0.430
sid[T.38]                      -0.7803      0.171     -4.565      0.000      -1.115      -0.445
sid[T.39]                      -0.6693      0.170     -3.947      0.000      -1.002      -0.337
sid[T.42]                      -1.4754      0.192     -7.667      0.000      -1.853      -1.098
sid[T.44]                       1.2138      0.161      7.533      0.000       0.898       1.530
sid[T.45]                       0.2702      0.182      1.484      0.138      -0.087       0.627
sid[T.46]                      -1.0982      0.277     -3.962      0.000      -1.641      -0.555
sid[T.47]                      -1.1412      0.195     -5.853      0.000      -1.523      -0.759
sid[T.49]                      -1.1224      0.183     -6.138      0.000      -1.481      -0.764
sid[T.51]                      -1.4247      0.212     -6.708      0.000      -1.841      -1.008
sid[T.52]                      -1.4138      0.205     -6.898      0.000      -1.815      -1.012
sid[T.54]                      -0.9313      0.177     -5.261      0.000      -1.278      -0.584
sid[T.55]                       0.0668      0.170      0.394      0.694      -0.266       0.399
sid[T.56]                      -0.1048      0.154     -0.680      0.496      -0.407       0.197
sid[T.57]                     -12.7017    119.257     -0.107      0.915    -246.442     221.039
sid[T.58]                      -0.6118      0.427     -1.434      0.152      -1.448       0.224
sid[T.59]                      -1.3104      0.212     -6.174      0.000      -1.726      -0.894
sid[T.60]                     -13.4938    119.257     -0.113      0.910    -247.234     220.247
sid[T.61]                      -0.5787      0.178     -3.246      0.001      -0.928      -0.229
sid[T.62]                      -0.6456      0.276     -2.335      0.020      -1.187      -0.104
sid[T.64]                     -13.6254    119.257     -0.114      0.909    -247.366     220.115
sid[T.67]                      -0.1626      0.180     -0.903      0.366      -0.515       0.190
sid[T.68]                       0.0336      0.151      0.222      0.824      -0.263       0.330
sid[T.69]                      -0.7821      0.214     -3.649      0.000      -1.202      -0.362
sid[T.70]                      -0.7380      0.176     -4.188      0.000      -1.083      -0.393
sid[T.71]                      -0.4077      0.197     -2.072      0.038      -0.793      -0.022
sid[T.73]                      -0.8393      0.194     -4.335      0.000      -1.219      -0.460
sid[T.75]                       0.0326      0.161      0.203      0.839      -0.283       0.348
sid[T.77]                      -0.6609      0.426     -1.552      0.121      -1.495       0.174
sid[T.79]                      -0.8551      0.165     -5.192      0.000      -1.178      -0.532
sid[T.80]                      -0.9316      0.180     -5.178      0.000      -1.284      -0.579
sid[T.82]                      -1.6979      0.212     -7.995      0.000      -2.114      -1.282
sid[T.83]                      -1.5917      0.203     -7.853      0.000      -1.989      -1.194
sid[T.87]                      -0.5131      0.160     -3.210      0.001      -0.826      -0.200
sid[T.91]                      -1.1428      0.182     -6.264      0.000      -1.500      -0.785
sid[T.92]                      -0.5063      0.181     -2.804      0.005      -0.860      -0.152
sid[T.94]                      -0.4604      0.155     -2.962      0.003      -0.765      -0.156
sid[T.95]                      -1.0420      0.185     -5.623      0.000      -1.405      -0.679
sid[T.99]                      -1.2638      0.212     -5.957      0.000      -1.680      -0.848
sid[T.101]                     -0.9256      0.197     -4.697      0.000      -1.312      -0.539
sid[T.102]                     -1.2173      0.200     -6.097      0.000      -1.609      -0.826
sid[T.103]                    -13.7455    119.257     -0.115      0.908    -247.486     219.995
sid[T.104]                     -0.8747      0.277     -3.154      0.002      -1.418      -0.331
sid[T.106]                      0.7139      0.210      3.397      0.001       0.302       1.126
actv_grp[T.Blank]               0.0110      0.074      0.149      0.882      -0.134       0.156
actv_grp[T.Deeds]               0.0232      0.072      0.321      0.748      -0.119       0.165
actv_grp[T.Diagram]            -0.1319      0.073     -1.816      0.069      -0.274       0.010
actv_grp[T.FSM]                -0.8886      0.234     -3.805      0.000      -1.346      -0.431
actv_grp[T.FSM_Related]        -0.6528      0.182     -3.578      0.000      -1.011      -0.295
actv_grp[T.Other]               0.0238      0.076      0.313      0.755      -0.125       0.173
actv_grp[T.Properties]         -0.0863      0.073     -1.189      0.235      -0.229       0.056
actv_grp[T.Study]               0.0159      0.075      0.212      0.832      -0.131       0.163
actv_grp[T.Study_Materials]    -0.2142      0.172     -1.245      0.213      -0.551       0.123
actv_grp[T.TextEditor]         -0.0107      0.073     -0.147      0.883      -0.153       0.132
PC01                            0.0564      0.004     15.066      0.000       0.049       0.064
PC02                           -0.0306      0.008     -4.048      0.000      -0.045      -0.016
PC03                           -0.1173      0.008    -15.385      0.000      -0.132      -0.102
PC04                            0.0250      0.008      3.073      0.002       0.009       0.041
PC05                           -0.0447      0.009     -5.207      0.000      -0.062      -0.028
PC06                           -0.1237      0.010    -12.826      0.000      -0.143      -0.105
PC07                            0.0997      0.012      8.191      0.000       0.076       0.124
PC08                           -0.0274      0.015     -1.835      0.066      -0.057       0.002
===============================================================================================
In [132]:
my_coefplot(best_model)
In [133]:
pc_df_to_model.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2444 entries, 0 to 2443
Data columns (total 82 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sess          2444 non-null   object 
 1   sid           2444 non-null   object 
 2   actv_grp      2444 non-null   object 
 3   final_events  2444 non-null   float64
 4   final_trials  2444 non-null   float64
 5   PC01          2444 non-null   float64
 6   PC02          2444 non-null   float64
 7   PC03          2444 non-null   float64
 8   PC04          2444 non-null   float64
 9   PC05          2444 non-null   float64
 10  PC06          2444 non-null   float64
 11  PC07          2444 non-null   float64
 12  PC08          2444 non-null   float64
 13  PC09          2444 non-null   float64
 14  PC10          2444 non-null   float64
 15  PC11          2444 non-null   float64
 16  PC12          2444 non-null   float64
 17  PC13          2444 non-null   float64
 18  PC14          2444 non-null   float64
 19  PC15          2444 non-null   float64
 20  PC16          2444 non-null   float64
 21  PC17          2444 non-null   float64
 22  PC18          2444 non-null   float64
 23  PC19          2444 non-null   float64
 24  PC20          2444 non-null   float64
 25  PC21          2444 non-null   float64
 26  PC22          2444 non-null   float64
 27  PC23          2444 non-null   float64
 28  PC24          2444 non-null   float64
 29  PC25          2444 non-null   float64
 30  PC26          2444 non-null   float64
 31  PC27          2444 non-null   float64
 32  PC28          2444 non-null   float64
 33  PC29          2444 non-null   float64
 34  PC30          2444 non-null   float64
 35  PC31          2444 non-null   float64
 36  PC32          2444 non-null   float64
 37  PC33          2444 non-null   float64
 38  PC34          2444 non-null   float64
 39  PC35          2444 non-null   float64
 40  PC36          2444 non-null   float64
 41  PC37          2444 non-null   float64
 42  PC38          2444 non-null   float64
 43  PC39          2444 non-null   float64
 44  PC40          2444 non-null   float64
 45  PC41          2444 non-null   float64
 46  PC42          2444 non-null   float64
 47  PC43          2444 non-null   float64
 48  PC44          2444 non-null   float64
 49  PC45          2444 non-null   float64
 50  PC46          2444 non-null   float64
 51  PC47          2444 non-null   float64
 52  PC48          2444 non-null   float64
 53  PC49          2444 non-null   float64
 54  PC50          2444 non-null   float64
 55  PC51          2444 non-null   float64
 56  PC52          2444 non-null   float64
 57  PC53          2444 non-null   float64
 58  PC54          2444 non-null   float64
 59  PC55          2444 non-null   float64
 60  PC56          2444 non-null   float64
 61  PC57          2444 non-null   float64
 62  PC58          2444 non-null   float64
 63  PC59          2444 non-null   float64
 64  PC60          2444 non-null   float64
 65  PC61          2444 non-null   float64
 66  PC62          2444 non-null   float64
 67  PC63          2444 non-null   float64
 68  PC64          2444 non-null   float64
 69  PC65          2444 non-null   float64
 70  PC66          2444 non-null   float64
 71  PC67          2444 non-null   float64
 72  PC68          2444 non-null   float64
 73  PC69          2444 non-null   float64
 74  PC70          2444 non-null   float64
 75  PC71          2444 non-null   float64
 76  PC72          2444 non-null   float64
 77  PC73          2444 non-null   float64
 78  PC74          2444 non-null   float64
 79  PC75          2444 non-null   float64
 80  PC76          2444 non-null   float64
 81  PC77          2444 non-null   float64
dtypes: float64(79), object(3)
memory usage: 1.5+ MB
In [134]:
pc_df_to_model.head()
Out[134]:
sess sid actv_grp final_events final_trials PC01 PC02 PC03 PC04 PC05 ... PC68 PC69 PC70 PC71 PC72 PC73 PC74 PC75 PC76 PC77
0 1 1 Aulaweb 2.0 2.0 -1.343494 0.614381 3.048655 -2.980032 -0.717602 ... 0.021226 0.007764 -0.042438 0.007726 0.016882 -0.033860 -0.020212 -0.020811 0.002340 0.013947
1 1 1 Blank 2.0 2.0 2.264423 -0.252348 3.635242 -1.695365 -0.567017 ... -0.054488 0.016730 0.010136 -0.022828 -0.049764 -0.025509 -0.011965 -0.003058 0.010145 -0.003176
2 1 1 Deeds 2.0 2.0 2.407197 -0.285384 3.514516 -1.835526 -0.700168 ... -0.008969 -0.002833 -0.015106 0.000015 -0.020132 -0.009533 -0.014708 0.001780 0.028581 -0.000724
3 1 1 Diagram 2.0 2.0 1.800267 -0.177009 3.836746 -0.226538 0.360890 ... -0.052102 0.019233 -0.019455 -0.021534 -0.068026 -0.055084 0.009613 0.001835 -0.002256 -0.030537
4 1 1 Other 2.0 2.0 2.285621 -0.236315 3.690869 -1.931407 -0.768909 ... -0.031325 -0.003566 0.041341 -0.008955 -0.027152 -0.000594 -0.012622 0.027929 0.006442 -0.031819

5 rows × 82 columns

In [135]:
pc_df_to_model.PC01.min()
Out[135]:
-16.59044209268085
In [136]:
pc_df_to_model.PC01.max()
Out[136]:
25.04659416706056
In [137]:
pc_df_to_model.sid.unique()
Out[137]:
array([1, 2, 4, 5, 7, 9, 11, 12, 14, 19, 20, 22, 30, 34, 37, 38, 39, 42,
       44, 46, 47, 49, 51, 52, 54, 55, 56, 59, 62, 67, 68, 70, 71, 73, 79,
       80, 82, 87, 91, 92, 94, 101, 102, 104, 8, 24, 61, 83, 95, 99, 103,
       25, 45, 69, 75, 106, 33, 57, 58, 60, 64, 77], dtype=object)
In [138]:
input_grid_pc01 = pd.DataFrame([ (xa, xb, xc, xd, xe, xf, xg, xh, xi, xj) for xa in np.linspace(pc_df_to_model.PC01.min() - 0.02, pc_df_to_model.PC01.max() + 0.02, num=101)
                            for xb in [0.]
                            for xc in [0.]
                            for xd in [0.]
                            for xe in [0.]
                            for xf in [0.]
                            for xg in [0.]
                            for xh in [0.]
                            #for xi in pc_df_to_model.sid.unique()
                            for xi in [5, 14, 20, 44, 87, 94, 102, 106]
                            for xj in actv_subgrp_1],
                           columns = ['PC01','PC02','PC03','PC04','PC05','PC06','PC07','PC08','sid','actv_grp'])
In [139]:
input_grid_pc01.describe()
Out[139]:
PC01 PC02 PC03 PC04 PC05 PC06 PC07 PC08 sid
count 4848.000000 4848.0 4848.0 4848.0 4848.0 4848.0 4848.0 4848.0 4848.000000
mean 4.228076 0.0 0.0 0.0 0.0 0.0 0.0 0.0 59.000000
std 12.152093 0.0 0.0 0.0 0.0 0.0 0.0 0.0 39.932179
min -16.610442 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.000000
25% -6.191183 0.0 0.0 0.0 0.0 0.0 0.0 0.0 18.500000
50% 4.228076 0.0 0.0 0.0 0.0 0.0 0.0 0.0 65.500000
75% 14.647335 0.0 0.0 0.0 0.0 0.0 0.0 0.0 96.000000
max 25.066594 0.0 0.0 0.0 0.0 0.0 0.0 0.0 106.000000
In [140]:
input_grid_pc01.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4848 entries, 0 to 4847
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   PC01      4848 non-null   float64
 1   PC02      4848 non-null   float64
 2   PC03      4848 non-null   float64
 3   PC04      4848 non-null   float64
 4   PC05      4848 non-null   float64
 5   PC06      4848 non-null   float64
 6   PC07      4848 non-null   float64
 7   PC08      4848 non-null   float64
 8   sid       4848 non-null   int64  
 9   actv_grp  4848 non-null   object 
dtypes: float64(8), int64(1), object(1)
memory usage: 378.9+ KB
In [141]:
input_grid_pc01['pred_probability'] = best_model.predict(input_grid_pc01)
In [142]:
sns.relplot(data = input_grid_pc01, x='PC01', y='pred_probability', 
            hue='actv_grp', kind='line')

plt.show()
In [143]:
sns.relplot(data = input_grid_pc01, x='PC01', y='pred_probability', hue='actv_grp', 
            col='sid', col_wrap=2, kind='line')

plt.show()
In [ ]: